Datawise

Concatenation operator in BigQuery

Constantin Lungu — Sun, 28 Apr 2024 21:43:07 GMT

You might have encountered the slightly odd-looking || in SQL before, whether in BigQuery or your other database system.

If not yet, it's called the 'concatenation operator' and well, it concatenates things.

In fact, it's the ANSI SQL standard concatenation operator so in theory it should work across database engines (but it doesn't - for example SQL Server uses + instead for concatenating strings).

In BigQuery, it does the same thing as CONCAT() for STRINGs and ARRAY_CONCAT() for ARRAYs .

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

Using FORMAT_DATE in BigQuery

Constantin Lungu — Sun, 28 Apr 2024 21:29:54 GMT

The code we write daily as Data Engineers is not necessarily complicated.

We're solving a lot of problems like the following:

Given a schedule per day of the week (Monday hours are 10:00 - 22:00 / 10 am - 10 pm), find out what was the schedule for a list of calendar days.

Here's how a BigQuery solution could look like:
- using FORMAT_DATE we can extract the abbreviated week day (%a in the list of format elements for date and time parts, attached in comments)
- transform that match casing of the joined column
- (INNER) JOIN

How would your solution to such a problem look like?

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

Raising ERRORS in BigQuery

Constantin Lungu — Wed, 24 Apr 2024 21:30:45 GMT

Does anyone have interesting use cases for the ERROR function in BigQuery?

If you like BQ errors so much that you've decided to create your own, or if you're debugging with dirty data, maybe check it out.

If will raise an error that you specify whenever executed. Plus you can also combine it with FORMAT to see what was the value that generated the issue.

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

Cleaning up STRINGS in BigQuery

Constantin Lungu — Wed, 24 Apr 2024 21:27:43 GMT

Data is collected and processed in a number of ways, and it should come as no surprise that it's not always perfect.

Perhaps the most important thing you need to do before analyzing data is have a look at how it's presented and check for irregularities.

Before any sound analysis a great deal of attention needs to be paid to cleaning the data.

Take string columns for instance. In hashtag#BigQuery, as with other engines, there is a wealth of functions helping you to process strings, including:

- TRIM/RTRIM/LTRIM for getting rid of the whitespace
- REPLACE to replace a substring with another one
- UPPER/LOWER/NORMALIZE etc to control casing
- SUBSTR/SUBSTRING to cut strings and so on.

The main goal here is to bring everything to a common denominator, being able to tell which observations belong together and which data can be considered "missing".

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

Lessons from Chernobyl and how it relates to software engineering

Constantin Lungu — Mon, 22 Apr 2024 20:53:48 GMT

Growing up in post-Soviet Moldova, I was repeatedly reminded of the Chernobyl disaster. It was a topic at school, we had liquidators in our small town, and I recall the television coverage each April as the anniversary approached.

The Chernobyl disaster stands out as one of the most horrific technological failures in history. I was struck by how a technology meant for good could cause such devastation when things went wrong. This sparked my interest, leading me to delve deeper into the details. There's even an exceptional TV miniseries that captures the event.

Thanks for reading Not just SQL! Subscribe for free to receive new posts and support my work.

Like many catastrophic events, Chernobyl was a result of a dark combination of factors and coincidencesdesign flaws, human errors, negligence, technical oversights, flawed processes, and conflicting motivations.

Conflicting motivations? Yes! Despite knowing about the reactor's design flaws, people remained silent for fear of reprisal. Construction shortcuts were taken to meet deadlines and secure bonuses.

On that fateful night of April 26th, 1986, greed likely drove some to rush safety tests for bonuses, while fear paralyzed others from challenging reckless orders.

You might wonder, why does this matter now, 40 years later? And how does it relate to software?

Consider your team or company culture. Are you incentivized to prioritize doing the right thing over meeting deadlines or earning bonuses?

Is there an environment that fosters speaking up when things are amiss? Can you take responsibility for mistakes, even if it means facing criticism?

Developing quality software is not an easy task, and aligning people's interests with the greater good is crucial.

Establishing a culture of openness and transparency, alongside appropriate incentives, should be a priority for anyone leading a team.

In life and software, mishaps occur. However, when people's motivations are aligned, we can collectively work better towards the same goal.

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

Using INSTR in BigQuery

Constantin Lungu — Mon, 22 Apr 2024 20:50:42 GMT

If you ever need to do something different based on the existence of a particular substring in hashtag#BigQuery, take a look at the INSTR function.

It returns the 1-based index of the first occurrence of a substring (1 or more characters) in another STRING. The function returns 0 if the substring was not found.

There's also a possibility to specify at what position to start the search (like in good old Excel) and which occurrence to get.

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

Cross-dataset Foreign Key referencing in BigQuery

Constantin Lungu — Fri, 19 Apr 2024 14:51:22 GMT

In one of my previous posts I've written about the then-newly-added Primary Key/Foreign Key constraints in BigQuery.

While they are not enforced like traditional RDBMS, they can still provide an improvement to query performance.

They have one important catch though - tables with a primary key and foreign key relationships must be in the same dataset - see error in (1)

How do you go around that?

Well, you could just do a regular copy of the referenced table into your dataset, but it would incur additional storage costs. Maybe not worth it if the table is big.

But there's another BQ feature we can use - table clones.

We can create a table clone of the table we want to reference in our desired dataset (2).

Then, we can reference the table clone when defining the Foreign Key constraints. (3)

We should keep in mind that identical data from source table and clone table is charged only one - so you'd only pay for the storage of different data, if that's the case.

Found it useful? Check out to my Analytics newsletter at notjustsql.com.

Pay attention to cardinality & grain when UNNESTING in BigQuery!

Constantin Lungu — Fri, 19 Apr 2024 14:39:41 GMT

Whenever you're UNNESing an ARRAY, you're getting a Cartesian product between the row and the array contents. If you were to unnest another array, you'll get another Cartesian product between the output of the previous unnest and the elements in the current array.

Let's look at an example. A student has their grades stored in an ARRAY as well as their food allergies in another ARRAY.

If we are to UNNEST both array we'll end having count_of_grades x count_of_allergies rows for each student, 4x3 in this case.

Why this happens? Well the allergies and grades have no relationship between each other, they just refer to the same student row.

Take this into account when you're working with nested data.

Found it useful? Check out to my Analytics newsletter at notjustsql.com.

Constructing STRUCTS in BigQuery

Constantin Lungu — Fri, 19 Apr 2024 14:36:26 GMT

After my previous STRUCTS in BigQuery post, I could not have skipped to mention number of options when it comes to construct one.

You can choose whether to provide field type, field name or both.

There's 3 main ways:
- via tuple ('a', 1) - BQ creates a STRUCT and infers the field types from provided values
- untyped STRUCT('a', 1) - untyped here means you'd don't declare the type, but it is rather inferred from the value literal or the column provided as input
- typed STRUCT - you declare the types of the fields in the structs

There's a couple of things to be kept in mind.

If you don't provide a field name it would be an anonymous field, meaning you won't be able to access it by field name If you want explicit types + field names you need to declare the field names together with the types

Watch out: ordering of fields matters in a STRUCT, so for example

SELECT STRUCT(1 AS a, 2 AS b)UNION ALLSELECT STRUCT (2 as b, 1 as a)

won't match fields according to names!

Found it useful? Check out to my Analytics newsletter at notjustsql.com.



Using STRUCTS for quick analysis in BigQuery
Constantin Lungu — Fri, 19 Apr 2024 13:59:14 GMT
I've posted earlier about STRUCTS in BigQuery, here's how I use it from time to time to help me debug and analyze data a bit faster.
Since changing filter values for different test cases / observations you are interested about can be a headache (especially if you have a lot of columns), you could put them in a tuple of STRUCTS and check the matching records at once.
Not a game changer but makes life a bit easier 😁
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Understanding STRUCTS in BigQuery
Constantin Lungu — Fri, 19 Apr 2024 13:56:31 GMT
I've previously did a short intro post about ARRAYS in BigQuery, but I do see from time to time people that are just getting started become confused about how are they different from STRUCTS and when should we use them.
Let's clarify this.
STRUCT = a bundle of "columns" inside a single column. STRUCTS fields have a mandatory data type (inferred or declared by you) and an optional name. You're still having one instance of each field, but this field can be an ARRAY of multiple things.
ARRAY = a bundle of "rows", a list inside a single row. They NEED to be of the same type - INT64, STRING, STRUCT etc. They cannot have another ARRAY directly under them, but you can get around that with an ARRAY(STRUCT([array_inside_struct])
You can still combine the two as you want, nest them in multiple layers, as long as you don't have an ARRAY directly under another ARRAY.
Both of them you can compare but you cannot order by or group by.
For STRUCT, you can totally ORDER BY or GROUP BY one of the fields in the STRUCT (as long as it's not a STRUCT or ARRAY itself).
When to use each? Let's look an example.
STRUCT = a bundle of fields that relate to the same "thing" - say your current address - city, street name, postal_code etc. Helps with a cleaner, more intuitive schema. TYPE = RECORD, MODE = NULLABLE
ARRAY = a list of things (0, 1 or more) that are related to this observation, for example a list of instruments a person plays on. TYPE = your_data_type, MODE = REPEATED
ARRAY of STRUCTS = you have a list of "things" that you know multiple things about and want to keep the together i.e. certifications => (name, from_date).
Would also be good to store all them addresses a person ever had. TYPE = RECORD, MODE = REPEATED
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Sharded tables in BigQuery
Constantin Lungu — Mon, 15 Apr 2024 21:46:36 GMT
Have you ever worked with sharded tables in BigQuery?
I've encountered them in a project long time ago and haven't seen them much around since.
Does the name not ring a bell?
Well, think of it as pseudo-partitioning, a way to store data split between different tables, each having a different suffix in the name dataset.table_name_{your_suffix}.
We'll be getting different tables, but they can be queried as one by using a wildcard *, retrieving data from all the tables matching the wildcard, like having an invisible UNION ALL behind the scenes.
In practice, I've seen these suffixes most of the time being dates like YYYYMMDD. So, in this case, BigQuery docs discourage this usage of sharding, citing the overhead in terms of storing a separate schema and metadata + permission checks as compared to just using a date-partitioned table.
So you're just better off to use a partitioned table in this case.
They even offer a quick way to convert a group of date-sharded tables to a regular date-partitioned table.
Have you ever encountered any interesting use cases for sharding?
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Why does your Data Warehouse need to look more like a pharmacy than a retail store?
Constantin Lungu — Mon, 15 Apr 2024 21:06:33 GMT
I was following a discussion about the Self-Service paradox. Then I've seen another LinkedIn post about the random 'one-off' requests.
And then memories just poured in.
When I just started working in Analytics a decade ago, I was pretty hyped up with this idea of data democratization. As an analyst with too much time on my hands, I thought: with the right infra in place, everyone interested can access all data in the org (minus sensitives of course) and uncover some insights that will help improve how things work. Data is fair game for everyone.
Time flies. Ive picked up SQL and with a couple of years of experience, I was working as a Business Intelligence Developer, working across multiple business areas and industries. The reality struck me.
Even apparently (for my unprepared mind) straightforward things like revenue, there was not any consensus on how it should be calculated. One person might think of it as revenue once the order has been placed, another once it was paid for and yet another one it was handed out for delivery. Or delivered? You get the idea.
Then, I could get odd one-off request asking for this or that piece of data. Aggregated at this level, filtering out X and Y. Theyre all the same but equally different. People would build entire siloed ecosystems of their own, report packs, presentation decks, all sourced from raw data provided by engineers. It became clear that this is not a sustainable track.
Why was this happening? Maybe the analyst wasnt comfortable with SQL or with the self-service tools of the time like SSAS Cubes + Pivots in Excel. Perhaps there is no data dictionary in place so people don't know what these columns represent. And what are these huuuge integer keys for?
Does this amount include tax or not? How about shipping? What about returns?
Or maybe this data hasn't landed yet in the Data Warehouse and we are due to integrate it sometimes Q3 in 2 years. Or maybe they don't trust out transformations?
Without proper controls, this free-for-all leads to conflicting metrics, a skewed understanding of how good a company is doing and overall, and overall, different yardsticks doing the measurements.
With regards to the analogy Ive used in the title, I believe that the we should treat our Data Warehouse more like a pharmacy and less like a retail store.
Surely, there are over-the-counter products, MDs and treatments. But every one of those comes with instructions and a number of dedicated professionals that can guide and help you when needed. And for some you need a prescription.
I'm very happy when someone uses the data that I help deliver. We have a duty to act as stewards and custodians for the data we are entrusted to process, including the Analytical products it is feeding. Not restrict and keep exclusive to our silos, but inform, educate, collaborate and build for better business outcomes.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Comparing tables with FULL OUTER JOIN
Constantin Lungu — Thu, 11 Apr 2024 21:25:10 GMT
Does your Data Engineering project use a data-diffing tool?
Say you're preparing to deploy a change to a prod table. You've changed the way some metrics are calculated and twisted some filters. How do you find out what's different between two tables? Identify expected vs unexpected differences?
If lacking a specialized tool for data-diffing (like Datafold, Recce ) perhaps the simplest validation you can do when comparing two tables (dev and prod versions for example) leverages the FULL OUTER JOIN (or FULL JOIN in some RDBMS).
Start with the grain. Is there any way you can bring the tables to the same grain?
Once you have aligned them to the same grain, you can now join on the respective keys and COUNT the occurrences you care about - what's missing from A, what's missing from B, totals overall. Depending on attributes, you can use other aggregate functions to assess differences - for example, does the SUM of sales amounts match in prod vs dev?
You could also leverage a hashing function + TO_JSON_ARRAY (check my previous post) to see which rows are different in the two tables.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Splitting a STRING in BigQuery
Constantin Lungu — Thu, 11 Apr 2024 21:15:36 GMT
Splitting a string in BigQuery works pretty much the same as in Excel.
SPLIT works in a similar way as its Excel cousin TEXTSPLIT - taking a string to be split and a delimiter (can be multiple characters), and returns an array of elements.
You can access them using the 0-based index or check out my previous post on more options for accessing array elements in BigQuery.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Search Indexes in BigQuery
Constantin Lungu — Sat, 06 Apr 2024 22:29:41 GMT
Here's something that might be interesting if you analyze large volumes of STRING or JSON data in BigQuery.
Let's look at SEARCH indexes and what you need to know to get started.
So, what are they used for?
You have a big table (>10 GB) with STRING or JSON columns which you perform text analysis on.
Search indexes can help retrieving data more efficiently from unstructured/semi-structured data - columns of type STRING, JSON, ARRAY of STRING, STRUCTS with STRING or JSON columns.
It can help optimize the usage of SEARCH funciton as well as other operators you use with string fields like 'STARTS_WITH', 'IN', '=' or 'LIKE'.
How to create one?
CREATE SEARCH INDEX term_search_index ON dataset.table_name(ALL COLUMNS/1 or more columns);
Based on what columns you've indexed, you can leverage the search index to search the entire table (columns with the compatible datatypes) or just a subset of columns of interest.
How to know if an search index is used?
Check the 'Job Information' of your BigQuery. This will tell if you if an index was used, and if not, what was the reason.
Further reading
Check out text analyzer options to see what different use case you can cover better. Maybe a future post about this 😁
Words of caution
- works best when you have a lot of distinct values (high query selectivity)
- if you've indexed all the columns any new compatible (STRING, JSON) column in that table will be indexed as well
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


A portable Data Analytics stack using Docker, Mage, dbt-core, DuckDB and Superset
Constantin Lungu — Fri, 05 Apr 2024 08:55:24 GMT
Just wanted to share a small learning-by-doing project of mine. It's a containerized Data Analytics suite, covering end-to-end analytics process for a small (imaginary) company.
We're talking about:
- generating example data in parquet files using Python
- ingesting data into DuckDB
- model data using dbt-core
- loading a DuckDB datamart
- orchestrate using MageAI
- displaying it all in a Superset dashboard.
Each of the components is in a separate Docker container, tied all together with docker-compose.
I've previously set up similar projects with Airflow and Dagster.
It's pretty bare bones (somewhat as intended) and has some rough edges, but it should be a good starting point for a demo, template or learn how all these components works together.
I would of course appreciate any feedback or suggestions on how to make it better.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Extract all pattern occurrences in BigQuery
Constantin Lungu — Thu, 04 Apr 2024 20:08:12 GMT
If you ever need to extract information based on a pattern in a BigQuery string, check out the REGEXP_EXTRACT_ALL function.
This will return an array of all the occurrences matching the specified regular expression.
With regards to the pattern itself, I typically use a representative example with a regex debugger like regex101.
Worth noting that it has a limitation - it would only work with a single regex capture group, so you can't match multiple patterns at the same time.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Why you should use UNION DISTINCT sparingly
Constantin Lungu — Tue, 02 Apr 2024 23:12:39 GMT
Let's help BigQuery do less unneeded work!
If you're UNIONING two sources known to have distinct values (and they don't have duplicates), go for UNION ALL instead of UNION DISTINCT (UNION for some other sql dialects) to avoid redundant de-duplication.
In the example below, I've unioned two Google Trends tables - one that is only for US terms and another one for the rest of the world. Since one table only contains US and the other everything except the US, we know the union of the two tables to be distinct from the start, thus not needing the UNION DISTINCT.
There's no difference indeed for on-demand pricing (same amount of data scanned), but quite a difference for capacity pricing users ( 1/2 of slot usage).
So use UNION DISTINCT (and any other DISTINCT) sparingly and when you actually need it.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


ORDER BY expressions in SQL
Constantin Lungu — Mon, 01 Apr 2024 13:56:56 GMT
Friendly reminder: when you ORDER BY something in SQL, that something does not necessarily need to be a column, but could be an expression, the output of which can be ordered.
In the example below, we'd like to ORDER by sales decreasingly, but show the 'direct' sales first.
This is achieved by using a CASE WHEN that will rank direct sales above other types of sales, then sorting by the sales decreasingly.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Accessing ARRAY elements in BigQuery
Constantin Lungu — Sun, 31 Mar 2024 15:41:50 GMT
So here's 3 ways we can access elements in a BigQuery array.
- by index: array[index], starting at 0
- using OFFSET(index): array[OFFSET(index)], also starting at 0
- using ORDINAL(1-based index)), starting at 1
The above will return an "index out of range" error if they are out of bounds, so to get around that you can using SAFE_OFFSET and SAFE_ORDINAL.
If you'd like to see what position each elements resides at in the array, check WITH OFFSET.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Enumerating ARRAY elements in BigQuery using WITH OFFSET
Constantin Lungu — Sun, 31 Mar 2024 08:00:26 GMT
In a previous post we've covered what ARRAYS are in BigQuery, their use cases and how to flatten them with UNNEST.
Quite important to mention, ARRAYS are ordered collections (like lists in Python) - you set up that order when creating it. By UNNESTING them, the order is no longer guaranteed.
In order to retrieve the order in which an element was in an array before UNNESTING (apart from ordering again by something in the array like a timestamp) you can use WITH OFFSET, which will yield an additional column, showing the 0-based index of the element in the original array.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


UNNESTING ARRAYS in BigQuery
Constantin Lungu — Sat, 30 Mar 2024 22:29:05 GMT
Here's perhaps my favorite feature in BigQuery and another one I discovered when switching from SQL Server. It's one of its most powerful features - the support for ARRAYS.
Although a bit intimidating when seeing them for the first time, they allow for efficient storing and modelling one-to-many relationships, and have deep performance implications.
In other database systems that don't support arrays, you'd have to store them in a separate table or resort to workarounds like storing in list-like strings or JSON.
Working with them is quite straightforward once you get the hang of it. Probably the most common operation to do with them is to UNNEST them.
The ARRAY is basically a set of rows inside one of the columns. In the example below each of the members have a list of activities they're signed up for, together with the date they enrolled.
While we only have 4 members (and consequentially 4 rows) , each member can have 0, 1 or multiple activities they're subscribed to.
If we want to perform operations (say find out what was the earliest enrollment date, or count the distinct activities that the members are enrolled in), we'd need to unpack these rows by UNNESTing the ARRAY where activities are stored it.
This will bring us from 1 (table) row per member to 1 row per each activity a member is enrolled in.
Pay attention here to how we're joining the UNNEST - this will determine if we keep or not the members that don't have any activities.
LEFT JOIN = keep them
CROSS JOIN / , (also a cross join) / INNER JOIN = exclude them
Do remember to give the UNNESTed items a proper logical name i.e. if you're UNNESTING activities, call it activity for better readability.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Boolean data type in BigQuery
Constantin Lungu — Wed, 27 Mar 2024 16:32:03 GMT
I remember that one of the things that struck me some years ago when switching from SQL Server to BigQuery was the existence of the bool data type in the latter, which I didn't have before.
I still see from time to time BigQuery code that does not leverage it to the fullest.
For example, this means you can just say val_b > val_a AS is_val_b_higher for a boolean flag instead of comparing and determining TRUE/FALSE. Also, just WHERE is_val_b_higher instead of WHERE is_val_b_higher IS/= TRUE.
This, in my opinion, makes the query much more readable when working with boolean flags.
If you properly name the flags, reading the query feels much closer to natural language.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


COALESCE vs IFNULL vs NULLIF in BigQuery
Constantin Lungu — Mon, 25 Mar 2024 16:34:59 GMT
What are they and when to use them?
- IFNULL tests a column for the NULL value, returning the original value if it is NOT NULL and the second value we provide otherwise. The two columns need to be coercible to the same datatype. It's works like ISNULL in SQL Server.
- COALESCE works like IFNULL, but for multiple values. The first of them to return a non-null value is returned, otherwise resulting in a NULL.
- NULLIF allows you to replace a given value with a NULL, essentially saying "treat this value as it is was a missing value". An empty string, for example.
See below a representative example.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


GREATEST  & LEAST in BigQuery
Constantin Lungu — Mon, 25 Mar 2024 16:26:45 GMT
Have you ever had to compute the biggest or smallest value across multiple columns?
If so, note that in addition to using CASE WHEN or IF, we have GREATEST and LEAST, which will do exactly that.
I've also though they reminded me of how MIN / MAX in Excel works.
They also come with the advantage that you can compare multiple values at once.
As usual, pay attention to the NULLs - if one of values is NULL, the result would be as well.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


RANGE data type in BigQuery
Constantin Lungu — Thu, 21 Mar 2024 14:52:10 GMT
I work quite a lot with temporal/SCD2 type table so the new (still in preview) RANGE data type in BigQuery (and its supporting methods) are a welcome addition.
What does it do?
So instead of storing valid_from & valid_to in separate columns, we now have a datatype to store the time segment in an [valid_from, valid_to) interval, of the form:
SELECT RANGE(DATE '2021-01-01', DATE '2023-01-01').
Note that the interval is left closed, right open (so left bound is included while the right one not).
This new semantic comes with a set of compatible functions :
- constructors for RANGE and arrays of RANGEs
- RANGE_START and RANGE_END to determine start and end of a segment
- RANGE_OVERLAPS, RANGE_INTERSECT and RANGE_CONTAINS to test the existence of an overlap, obtain the segment that overlaps and test the inclusion of a RANGE in another RANGE , respectively
While perhaps not a game changer, I still find the value in this upcoming feature.
Again, since this is still in preview it is not yet ready to use used in production.
See below an illustration of how it is used.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


DELETE + INSERT vs MERGE in BigQuery
Constantin Lungu — Mon, 18 Mar 2024 21:58:44 GMT
How do you merge changes from staging tables into target tables in BigQuery?
I've previously covered swapping out partitions using bq command and using constant false predicate "MERGE on FALSE", but I've learned that you can now DELETE entire partitions for free (provided a filter on the partitioned column is used) from tables.
That means that instead of merging your changes the old-fashioned way, it might be well worth DELETING the days you would like to update and INSERTING the entire days data back sourced from the staging table.
Here's a comparison of the two approaches for the same source and destination tables. As you can see the amount of processed data can be wildly different between the two.
Happy querying!
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Using GROUP BY ALL in BigQuery
Constantin Lungu — Tue, 05 Mar 2024 16:13:09 GMT
Featured in other database systems, the GROUP BY ALL has been announced in preview for BigQuery as well.
This will allow us to not enumerate all the non-aggregated columns when performing aggregates.
It's definitely better than GROUP BY 1,2,3 which would fail once we'd change the list of columns we'd like to group by. Overall, I find it a useful shorthand when exploring or debugging.
Here's an example of how it looks.
SELECT country, sell_date, SUM(sales) AS total_salesFROM input_data-- newGROUP BY ALL-- instead of-- GROUP BY country, sell_date
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Using zip in Python
Constantin Lungu — Fri, 23 Feb 2024 15:27:11 GMT
If you haven't encountered it already, note that zip is quite handy in Python.
It allows you to go through multiple iterables (such as lists, sets, tuples etc) at the same time, effectively "zipping" them into an iterator that will produce tuples of elements from each initial collection.
Worth keeping in mind that if the provided iterables' length differs, zip will stop when the shortest of them has reached its end.
Happy coding!
integers = [1,2,3,4,5]letters = ['a','b', 'c', 'd']for pair in zip(integers, letters):    print(pair)# Prints:# (1, 'a')# (2, 'b')# (3, 'c')# (4, 'd')
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Why you should care about partition pruning in BigQuery
Constantin Lungu — Thu, 22 Feb 2024 10:05:10 GMT
When it comes to performance improvements and cost savings, handling only as much data as we need is very important. And partitioning is a cornerstone here.
Now, when working with tables that are partitioned, BigQuery tries to exclude the partitions it does not need (akin to pruning a tree) based on filters (WHERE clause) and JOINS, thus saving you time and processing power (and money).
But it's not always that simple. If you perform operations on the partitioned field (say the date field in a date-partitioned table), Big Q might not be able to prune the table accordingly. So you'll end up processing the entire massive table, even though you were only after one single day.
There's an example below with this happening when converting the date to a different timezone, but I've seen it happen with other operations. Pruning would not work in MERGE statement sourced from two UNIONed partitioned tables.
Check the number of rows read from the table in the examples below.
Found it useful? Check out to my Analytics newsletter at notjustsql.com.


Calculating the MODE in BigQuery
Constantin Lungu — Wed, 21 Feb 2024 22:15:27 GMT
How do you compute the MODE (most frequent value) in BigQuery?
For the other measures of central tendency like MEAN and MEDIAN, there are straightforward ways to compute results - functions AVG and PERCENTILE_CONT/PERCENTILE_DIST respectively, but there's no dedicated function for MODE.
By the way, if you have a huge dataset and can bear some lack of precision, take a look at APPROX_TOP_COUNT.
Say we have the following input data:
Now here's how we can compute them otherwise:
- filter out NULLS (if we want to ignore them) or do nothing if we want to keep them
- compute value counts for our desired grain
- take the most frequent one per our grain using QUALIFY + RANK
Here's how the output would look with NULLS excluded.
And with them included:
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Using INCLUDE NULLS with UNPIVOT in BigQuery
Constantin Lungu — Sat, 17 Feb 2024 13:51:00 GMT
While solving a bug I was reminded again that, when UNPIVOTing, rows with NULL values are excluded. Fine.
But it turns out we have the option to specify the INCLUDE NULLS with UNPIVOT, thus allowing us to keep those rows in the result set.
Let's look at an example.
This is how it would look if UNPIVOTed as usual:
SELECT     measurement_date,     value,     measurement FROM inputUNPIVOT INCLUDE NULLS (value FOR measurement IN (water_level, temperature, pressure))
As you notice, we don't have the rows where the measurement values are NULL.
How can we fix it? Let's use UNPIVOT in conjunction with INCLUDE NULLS.
SELECT     measurement_date,     value,     measurement FROM inputUNPIVOT INCLUDE NULLS (value FOR measurement IN (water_level, temperature, pressure))
Voila! The NULL entries are here now.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Watch out when using SAFE_CAST in BigQuery
Constantin Lungu — Fri, 16 Feb 2024 14:59:06 GMT
Here's an interesting situation I've seen with BigQuery.
Say a source system provides JSON events with timestamps at microsecond grain (6 decimals, so something like 2024-01-01 14:00:00.123456).
This is cast using SAFE_CAST into a proper TIMESTAMP. All works just fine.
Until one day the source system sends the JSON with timestamps at NANOsecond grain.
Since timestamp has only MICROsecond precision, the cast quietly fails (no error), a null is returned from a seemingly correct looking timestamp.
Without proper monitoring this issue can go unnoticed quite a bit. So watch out 🤔
It just drives the point home on how important is to have proper monitoring in place and enforcing a robust data contract with data sources.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Rolling period calculation in BigQuery
Constantin Lungu — Wed, 17 Jan 2024 14:23:23 GMT
How to compute a rolling period calculation in BigQuery?
In one of my previous posts, I've showcased using RANGE inside window function declarations in the context of computing a cumulative sum.
Today we're going to look at another example often found in the wild - computing a rolling period calculation. Let's look at an example.
We have customer order information and would like to compute a per customer rolling sum of the previous 60 days' worth of purchases.
How this would look in terms of SQL?
SUM(order_total) OVER (PARTITION BY customer_id ORDER BY UNIX_DATE(order_date) RANGE BETWEEN 59 PRECEDING AND CURRENT ROW) AS rolling_60_days_sum
Let's explain it:
We will start by taking a SUM of order_total with a window declaration
and partitioning by customer_id.
Next is ordering by our order_date, but since RANGE only accepts a single integer field, we'll need to transform it using UNIX_DATE. This will transform '2021-01-01' into 18628.
We can now use RANGE, setting the range between the 59 previous days and the current row. This way, if we have any gaps or duplicates in our order data (which is very likely), the calculation would still work, as opposed to the approach of using ROWS.
See below an illustration of how it all works.
SELECT customer_id, order_id, order_total, order_date, SUM(order_total) OVER (PARTITION BY customer_id                       ORDER BY UNIX_DATE(order_date)                       RANGE BETWEEN 59 PRECEDING AND CURRENT ROW)                  AS rolling_60_days_sumFROM input_data+-------------+----------+-------------+------------+---------------------+| customer_id | order_id | order_total | order_date | rolling_60_days_sum |+-------------+----------+-------------+------------+---------------------+| Customer-1  | 10001    | 100         | 2021-01-01 | 100                 || Customer-1  | 10003    | 75          | 2021-02-15 | 175                 || Customer-1  | 10005    | 90          | 2021-03-12 | 165                 || Customer-1  | 10001    | 100         | 2021-04-21 | 190                 || Customer-1  | 10003    | 75          | 2021-05-12 | 175                 || Customer-1  | 10005    | 90          | 2021-06-23 | 165                 || Customer-2  | 10002    | 80          | 2021-01-01 | 80                  || Customer-2  | 10004    | 120         | 2021-02-04 | 200                 || Customer-2  | 10006    | 50          | 2021-03-05 | 170                 || Customer-2  | 10002    | 80          | 2021-04-11 | 130                 || Customer-2  | 10004    | 120         | 2021-05-12 | 200                 || Customer-2  | 10006    | 50          | 2021-06-30 | 170                 |+-------------+----------+-------------+------------+---------------------+
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Generating date intervals in BigQuery
Constantin Lungu — Tue, 16 Jan 2024 11:01:59 GMT
Ever had to generate a date interval in BigQuery?
Take a look at the GENERATE_DATE_ARRAY function.
Needs 3 arguments:
- start_date
- end_date
- interval step (DAY, WEEK, MONTH, QUARTER, YEAR)
Since it generates an ARRAY, we would need to UNNEST it to get one date per row.
If you need something more granular, there is the very similar GENERATE_TIMESTAMP_ARRAY, which can generate in increments between MICROSECOND and DAY.
Friendly reminder to not mix and match DATETIME and TIMESTAMP without properly converting between them beforehand - see my previous post.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


DATETIME vs TIMESTAMP in BigQuery
Constantin Lungu — Tue, 16 Jan 2024 10:58:21 GMT
DATETIME and TIMESTAMP in BigQuery are not the same and should not be used interchangeably!
One thing I encounter from time to time is mixing of DATETIME and TIMESTAMP types. Even casually converting TIMESTAMP(DATETIME_COLUMN) with no timezone provided.
This should not be done and you will get a type mismatch error when you, for example, try to compare them, for a very good reason.
What's the difference?
 DATETIME is a local time, happening once per day across the globe, at different points in time - it's 17:00 on January 12 first in Tokyo, then Bangalore, London and finally Los Angeles.
 TIMESTAMP is an absolute point in time and uses the UTC as a reference.
You can of course convert between the two, but you will NEED to provide a timezone context:
- if starting with a DATETIME, you need to provide a source timezone for the TIMESTAMP to be computed
- if you have a TIMESTAMP, you need to provide a target timezone for the DATETIME to be computed.
See below an illustration of how it's done.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Using RANGE in Window Functions in BigQuery
Constantin Lungu — Thu, 11 Jan 2024 10:39:12 GMT
On my previous post about computing a cumulative sum in BigQuery I've got a question regarding the RANGE in the row_range specification of a window function. I've realized I never used it before. So I've decided to see what it's about.
So how does using RANGE inside an OVER() block differ from using ROWS?
First, it bears noting that unlike ROWS, which uses physical rows representation (previous row, next row etc) in a window, RANGE uses logical (previous value, next value), so that makes it quite useful in some situations.
Let's imagine the following scenario:
We have a group of athletes that compete in a running contest. We would like to compare each athlete's time to their peers - but we'll define a peer as someone who is born anywhere between the year before and the year after the athlete was. So for someone born in 1992, we would like to compute the average of athletes born in '91, '92 and '93 for comparison.
How will this be achieved?
We'll use the AVG aggregation function with a window function call, ORDER BY birth_year and set up a RANGE BETWEEN 1 PRECEDING year and 1 FOLLOWING year. This way, for our athlete born in '92, the average will be computed by including all athletes born in 1991, 1992, 1993.
To illustrate why the ROWS would not work here, look at the results for 1991. Since there's two athletes born in 1991, the ROWS clause would include only the previous row (also born in 1991) and the next row (born in 1992), thus missing the mark. Check the result in neighbours_average vs neighbours_average_wrong.
RANGE comes with a limitation though - you can only order by a single numerical column.
Hope this was interesting!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Computing a cumulative sum in BigQuery
Constantin Lungu — Thu, 11 Jan 2024 10:24:33 GMT
How do you compute a cumulative SUM in BigQuery?
Today we're going to look at how to compute a cumulative sum in BigQuery, a scenario that pops up now and then and is quite easy to solve using window functions.
In the below example, we have a dataset representing customer orders. We'd like to find out the cumulative sum of each individual customer.
For this we'll need :
- SUM function combined with a WINDOW function call
- PARTITION BY customer ID to perform calculation at customer level
- ORDER BY order_date (ascending by default) so that the values are summed up chronologically
- a window frame clause: ROWS BETWEEN UNBOUNDED (starting with the first entry) AND CURRENT ROW (until and including this row)
See below for an illustration of how it all works. Happy querying!
Bonus point: You can also use a named window declaration for cleaner code.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Optimizing storage costs in BigQuery
Constantin Lungu — Fri, 22 Dec 2023 23:53:23 GMT
As promised on a previous post about optimizing costs for compute, I've promised a post about storage as well, so let's dive in.
So how does BigQuery charge for storage?
Two billing models - set at dataset level: logical and physical.
Logical table size is the default option, representing uncompressed size. It is cheaper per Gib (roughly half of physical) and you get time travel storage for free.
Physical storage represents compressed, actual size of physical bytes stored on disk. It's twice as expensive as logical and you need to pay for the time travel storage as well BUT if you have data that compresses well (for example by using arrays), you might be able to save storage costs.
Now, we can further split these down into:
- active storage: any table or partition (for partitioned tables) modified in the last 90 days
- long-term storage: a table or partition (for partitioned tables) that hasn't changed in the last 90 days, and is 50% cheaper than the active storage.
Here's what you can do to help bring these costs down:
- Identify what is taking up storage space: Look at information schema views such as INFORMATION_SCHEMA.TABLE_STORAGE to find out what is taking up the space that you pay for.
- Cleanup: Clean up unneeded data or archive in cheap long-term storage. Think about storing aggregated version of historical data and save on storage.
- Leverage table clones and snapshots in BigQuery to save on storing redundant data
- Shorten time travel window (which is 7 days by default) - if you would not need to restore this data or not for the entire 7 days, you can reduce this (dataset level) and store less data. A staging table that is recreated every time, for example, might not need the 7 days. Read more in the post about BigQuery time travel.
- Leverage long-term vs short-term data: any table or partition that you don't modify for 90 days becomes 50% cheaper to store, so maybe don't rebuild massive historical tables every time you run the process.
- Use the table and partition expiration settings: you can set particular tables or partitions to expire after a given time, so you don't store data that you don't need
Hope these tips are useful!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Optimizing compute cost in BigQuery
Constantin Lungu — Thu, 21 Dec 2023 15:56:29 GMT
Let's talk money 💸 . Now that I've got your attention - more precisely, cost optimization as a regular user. As cloud data warehouse users, in addition to ticking off functional requirements and of course, getting the results in a timely manner, we care also about the costs.
BigQuery costs can be split into compute (what you process) and storage (what you store). Let's look at compute (I promise a post about storage, too).
So it's important to know that BigQuery has two billing models for compute: on-demand and capacity-based.
On-demand is pretty straightforward - you pay per amount of data scanned, say 7.5 $ per TB, depending on region. Imagine paying your 🍧 ice-cream by weight.
Capacity-based means your company has purchased a processing capacity, measured in slots, over a unit of time. Here you pay the reservation and not the amount of data you scan. Imagine paying your ice-cream per recipient that you fill. As long as it fits in, regardless of weight, you pay the fixed amount.
So how does optimizing for compute look like for a regular user? When trying out different approaches (especially for working with very big datasets):
- If your company is billed on-demand basis, aim for processing less data (see the top-right corner info BEFORE running the query). This means you will get billed less.
- If your company has capacity-based billing, aim for a lower slot time consumed (see execution details AFTER you've ran your query). This will ensure there is enough capacity for other workloads in your organization.
Now, they typically correlate, but there might be cases where you pick between a lower slot time or a lower amount of data processed.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Leveraging ARRAYS in BigQuery for query performance
Constantin Lungu — Tue, 19 Dec 2023 22:46:25 GMT
Leveraging your platform functionality goes a long way. When developing data pipelines, besides the functional requirements, we try to optimize for some other important variables, such as cost, resources consumed or runtime. While working with big or complex datasets in BigQuery, I always try a test several approaches to see which one yields a better mix from the above.
Take ARRAYs, for example. They allow us to define one-to-many relationships inside tables while saving up on storage and potentially processing power too.
I'll provide a short example. Say we have 100 customers and several thousand dates they visited a website. We could store this in two ways:
- classic approach with one row representing a unique id-date combination
100 ids x 3640 dates each = 364k rows
- leveraging ARRAYs and having one row = one id and its array of dates.
100 ids with an array of 3640 dates each = 100 rows
Let's run a quick query to test the performance of these two. In this particular case, the array example consumes a minuscule fraction of the slot time of the non-array example while still processing only half as many bytes.
I'm definitely not saying this is a "one size fits all" approach, depending of course on data structure, size, querying patterns and other constraints. But whenever you have a challenge like that, it's good to know your options, try out different strategies and pick the one that suits your use case best.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Decorators in Python
Constantin Lungu — Tue, 19 Dec 2023 22:38:33 GMT
One of the important programming patterns is the decorator pattern.
What do decorators do? They modify a function's behavior, allowing us to enhance to add functionality without changing its structure. But how does it work in Python?
You might have noticed the syntactic sugar notation for the decorator - the so-called 'pie' notation: @.
@decoratordef my_function():    do something
Let's look at a quick practical example. We're going to define a 'polite' decorator that will modify announcements issued in a train station.
We will then decorate a function issuing such an announcement.
Here's how it would look like:
Thanks for reading!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Using COUNTIF() in BigQuery
Constantin Lungu — Mon, 18 Dec 2023 10:33:00 GMT
Happy weekend! Do you remember the good old COUNTIF from Excel?
When I was just starting out in Analytics and was working with spreadsheets, I would make heavy use of it.
Well, turns out we have something similar in BigQuery as well.
It's an aggregate function, counting only rows that fulfill a given condition, for example COUNTIF( a > 10 and b = 'text')
Of course, this would be pretty much the same as combining COUNT + CASE WHEN.
See a quick example below.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Partial functions in Python
Constantin Lungu — Sun, 17 Dec 2023 09:55:45 GMT
Have you ever used partial functions? It's an interesting functionality that can be found in the functools package.
It's pretty straightforward - you can take a function that takes multiple arguments and produces a new function that has one or more arguments already set, effectively tailoring your previous function to a specific need.
See below an example of it in action.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Approximate Aggregate Functions in BigQuery
Constantin Lungu — Thu, 14 Dec 2023 08:32:46 GMT
Sometimes you don't need perfect, but just good enough. Take approximate aggregate functions in BigQuery, for example.
These are a type of aggregate functions that produce approximate results instead of exact ones but have the upside of typically requiring fewer resources for the computation.
When would I use one? This would be suitable where we can live with an uncertainty or small difference, especially for huge tables, during a preliminary check or data exploration.
Let's look at a practical example. Suppose we have the following data:
APPROX_TOP_COUNT will compute the approx top N elements and their value counts
SELECT     APPROX_TOP_COUNT(value, 5) AS top_value_counts FROM `learning.data_source`
APPROX_COUNT_DISTINCT will compute the approx distinct count (also can be grouped)
SELECT     APPROX_COUNT_DISTINCT(value) AS approx_distinct_value_count FROM `learning.data_source`
You can discover more approximate aggregate functions in the documentation.
Thanks for reading!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Using ARRAY_CONCAT_AGG() in BigQuery
Constantin Lungu — Tue, 12 Dec 2023 16:40:42 GMT
If you're working with ARRAYS in BigQuery, you might need to combine two arrays into one at some point in time. That's why I want to showcase an interesting function - ARRAY_CONCAT_AGG.
What does it do? It's an aggregation function, it allows us to stitch together arrays in a particular group, yielding a concatenated array.
🔎 But watch out for:
- duplicates - it will not filter out duplicates for you
- NULL ARRAY is OK, but a NULL member of an ARRAY is not - you'll get an error 
WITH example_table AS (  SELECT [1, 2] AS array_column UNION ALL  SELECT [3, 4] UNION ALL  SELECT NULL UNION ALL  SELECT [5, NULL, 6])-- this will raise an ERROR because of the NULL in the last rowSELECT ARRAY_CONCAT_AGG(array_column) AS aggregated_arrayFROM example_table
🥛+ 🍪 You can pair it with:
- ORDER BY to sort the INPUTS to the function (not the elements inside the array)
- LIMIT to keep only a specified number of input arrays
Let's see a practical example. Suppose our input data looks as follows.
We'd like to combine offices by country into a single array (they are currently stored in one array per region).
WITH input_data  AS (  SELECT 'US' AS country, [STRUCT('New York' AS city_name, 1000 AS staff_count), STRUCT( 'Boston' AS city_name, 500 AS staff_count) , STRUCT( 'Washington' AS city_name, 300 AS staff_count)   ] AS offices, 'US-East' AS region    UNION ALL   SELECT 'US' AS country, [STRUCT('Los Angeles' AS city_name, 700 AS staff_count) ,STRUCT('Denver' AS city_name, 400 AS staff_count)  , STRUCT('San Francisco' AS city_name, 250 AS staff_count)  ] AS offices, 'US-West' AS region  UNION ALL   SELECT 'CA' AS country,  [STRUCT('Calgary' AS city_name, 400 AS staff_count) ,STRUCT('Vancouver' AS city_name, 1100 AS staff_count)  , STRUCT('Edmonton' AS city_name, 150 AS staff_count)  ] AS offices, 'CA-West' AS region   UNION ALL   SELECT 'CA' AS country,  [STRUCT('Quebec City' AS city_name, 200 AS staff_count) ,STRUCT('Montreal' AS city_name, 750 AS staff_count)  , STRUCT('Toronto' AS city_name, 800 AS staff_count)  ] AS offices, 'CA-East' AS region  )  SELECT     country,     ARRAY_CONCAT_AGG(offices) AS offices   FROM input_data  GROUP BY country
Here's what the result would look like:
Thanks for reading!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Using ANY_VALUE()  in BigQUERY
Constantin Lungu — Mon, 11 Dec 2023 23:48:53 GMT
Have you ever used ANY_VALUE in BigQuery?
It's an aggregate function like SUM or AVG, but it returns a non-deterministic (not random) row from the group. I've been using it in scenarios where there's one value anyway, such as when PIVOTing.
WITH input_data AS (  SELECT 'Europe' AS Region, 'Q1' AS quarter, 250000 AS sales  UNION ALL  SELECT 'Europe' AS Region, 'Q2' AS quarter, 225000 AS sales  UNION ALL  SELECT 'Europe' AS Region, 'Q3' AS quarter, 275000 AS sales  UNION ALL  SELECT 'Europe' AS Region, 'Q4' AS quarter, 290000 AS sales  UNION ALL  SELECT 'MEA' AS Region, 'Q1' AS quarter, 190000 AS sales  UNION ALL  SELECT 'MEA' AS Region, 'Q2' AS quarter, 210000 AS sales  UNION ALL  SELECT 'MEA' AS Region, 'Q3' AS quarter, 300000 AS sales  UNION ALL  SELECT 'MEA' AS Region, 'Q4' AS quarter, 220000 AS sales)SELECT * FROM input_dataPIVOT(ANY_VALUE(sales) as sales FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4'));
Upon documenting myself for this post, I found an interesting thing - it supports the HAVING clause, allowing us to restrict the rows this function is aggregating, either by a MIN or MAX of a given expression.
Let's look at how it works. Say we have the following data:
We're going to compute the product that has sold the highest by value and the product that has sold the least by quantity in each of the countries.
WITH input_data AS (  SELECT 'Germany' AS country, 'productA' AS product_id, 200 AS quantity, 5.00 AS price  UNION ALL    SELECT 'Germany' AS country, 'productB' AS product_id, 75 AS quantity, 100.00 AS price  UNION ALL    SELECT 'Germany' AS country, 'productC' AS product_id, 100 AS quantity, 120.00 AS price  UNION ALL    SELECT 'Spain' AS country, 'productA' AS product_id, 300 AS quantity, 5.00 AS price  UNION ALL    SELECT 'Spain' AS country, 'productD' AS product_id, 250 AS quantity, 20.00 AS price  UNION ALL    SELECT 'Spain' AS country, 'productE' AS product_id, 100 AS quantity, 15.00 AS price)SELECT     country,     ANY_VALUE(product_id HAVING MAX quantity*price) AS highest_selling_by_value,    ANY_VALUE(product_id HAVING MIN quantity) AS lowest_selling_by_quantity,FROM input_dataGROUP BY country
Here's what the results would look like:
Thanks for reading!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Using LOGICAL_AND and LOGICAL_OR in BigQuery
Constantin Lungu — Sun, 10 Dec 2023 00:24:38 GMT
Today I wanted to share another BigQuery feature - maybe not the breathtaking - but definitely something to have in your toolbox. The occasion to use it might be around the corner.
So, have you ever encountered LOGICAL_AND() and LOGICAL_OR()? Think of them as aggregation functions but for boolean values. As the name implies:
- LOGICAL_AND() returns True if all values as True
- LOGICAL_OR() returns True if at least one value is True
A couple of things to know:
- It takes a boolean expression, but the boolean fields you use should not necessarily already be defined beforehand i.e. you can totally say: LOGICAL_AND(status = 'ACTIVE')
- As with other aggregation functions you would need to use group by to compute results into bins
Let's look at a quick example of how it all works. Based on the data below, we'd like to find out whether all of a particular customer's orders were paid for and whether the customer has any outstanding amounts for any of his orders.
+-------------+----------+---------+--------------------+| customer_id | order_id | is_paid | outstanding_amount |+-------------+----------+---------+--------------------+| 1           | 1001     | true    | 0                  || 1           | 1002     | true    | 0                  || 2           | 2001     | true    | 0                  || 2           | 2002     | false   | 100                || 3           | 3001     | false   | 150                || 3           | 3002     | false   | 250                |+-------------+----------+---------+--------------------+
We're going to use LOGICAL_AND to assess if all of the orders are paid and LOGICAL_OR to check if at least one order has an outstanding amount greater than 0. We also need to group by customer_id.
SELECT     customer_id,     LOGICAL_AND(is_paid) AS all_orders_paid,     LOGICAL_OR(outstanding_amount > 0) AS has_outstanding_amounts FROM input_dataGROUP BY customer_id
Here's our output of our query.
+-------------+-----------------+-------------------------+| customer_id | all_orders_paid | has_outstanding_amounts |+-------------+-----------------+-------------------------+| 1           | true            | false                   || 2           | false           | true                    || 3           | false           | true                    |+-------------+-----------------+-------------------------+
Thanks for reading!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.


Comprehensions in Python
Constantin Lungu — Wed, 06 Dec 2023 22:18:07 GMT
Do you work with Python comprehensions in your day-to-day?
Comprehensions are a straightforward way to create lists, sets, dictionaries and generators from existing iterables - such as lists or tuples.
They allow for short, concise notation as opposed to loops. As usual, it's important to think about code readability and not overuse them.
See below a quick worksheet with examples of common comprehensions in Python, applied in a simple scenario - computing number squares.
Thanks for reading!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.