<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Datawise — SQL, BigQuery & Python for Data Engineers]]></title><description><![CDATA[Practical SQL, BigQuery, and Python tutorials for data engineers. Real-world case studies, no fluff — written by a Staff Data Engineer with 12+ years in the field.]]></description><link>https://datawise.dev</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1698837854116/BOj6tvbdj.png</url><title>Datawise — SQL, BigQuery &amp; Python for Data Engineers</title><link>https://datawise.dev</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 13 May 2026 15:37:28 GMT</lastBuildDate><atom:link href="https://datawise.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[BigQuery BigLake Tables Explained: What They Are and When to Use Them]]></title><description><![CDATA[If you've worked with BigQuery external tables before, you know the basic idea: a thin wrapper around data that resides somewhere else, but queryable from BigQuery. Sources include Cloud Storage, Goog]]></description><link>https://datawise.dev/bigquery-biglake-tables-explained-what-they-are-and-when-to-use-them</link><guid isPermaLink="true">https://datawise.dev/bigquery-biglake-tables-explained-what-they-are-and-when-to-use-them</guid><category><![CDATA[bigquery]]></category><category><![CDATA[GCP]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[SQL]]></category><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sun, 26 Apr 2026 21:13:46 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/1bb27bbf-e1f3-4a05-928b-0d063216e85f.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you've worked with BigQuery <a href="https://cloud.google.com/bigquery/docs/external-tables">external tables</a> before, you know the basic idea: a thin wrapper around data that resides somewhere else, but queryable from BigQuery. Sources include Cloud Storage, <a href="https://datawise.dev/importing-google-sheets-into-bigquery">Google Sheets</a>, or Google Drive.</p>
<p>Today I'd like to talk about a special variety of external table: the <strong>BigLake table</strong>. It's built to bridge data lakes and data warehouses.</p>
<h3>BigQuery External Tables: Limitations and Pain Points</h3>
<p>With a regular external table, users need access both to the BigQuery table <em>and</em> to the underlying external data source.</p>
<p>If the data is in Cloud Storage, a user needs BigQuery permissions, but also permissions on the bucket and objects. This might be fine for a quick exercise, but the pain points are real:</p>
<ul>
<li><p>you manage permissions in multiple places, for different types of resources</p>
</li>
<li><p>bucket-level access can be too broad</p>
</li>
<li><p>it's harder to apply table-like governance on files</p>
</li>
</ul>
<h3>How BigLake Tables Work: Access Delegation Explained</h3>
<p>BigLake tables use a BigQuery Connection that accesses the files on behalf of the users. Users don't need direct access to the underlying buckets.</p>
<p>This enables table-level security on external data, including:</p>
<ul>
<li><p>row-level security</p>
</li>
<li><p>column-level security</p>
</li>
<li><p>dynamic data masking (for Cloud Storage BigLake tables)</p>
</li>
</ul>
<p>BigLake also enables — powered by BQ Omni — reading data from Amazon S3 and Azure Blob Storage.</p>
<h3>Metadata Caching in BigLake Tables: Faster Queries, Better Plans</h3>
<p>When querying external data, BigQuery needs to inspect the files first: what files exist, how they are partitioned, and what metadata they contain. With classic external tables, every query triggers a listing operation against the underlying storage.</p>
<p>If you have a small number of files, this is barely noticeable. If you have thousands or millions of files, especially Hive-partitioned data, it becomes painful.</p>
<p>BigLake tables unlock metadata caching. With it enabled, BigQuery skips the listing on every query and prunes files and partitions faster — avoiding reading unneeded files altogether.</p>
<p>For Parquet BigLake tables, metadata caching also collects table statistics, which helps the optimizer produce better query plans.</p>
<p>The cache has a staleness window you control, and you choose between automatic or manual refreshes (the manual option runs a stored procedure, useful if you want to make it event-driven).</p>
<h3>Creating a BigLake Table: Step-by-Step SQL Example</h3>
<p>Say we have some sales data in Google Cloud Storage as follows</p>
<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/60486a93-b5b8-44ba-a579-8304f73dcd42.png" alt="" style="display:block;margin:0 auto" />

<p>It is Hive partitioned Date -&gt; Country.</p>
<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/d755dd31-88e4-48e8-a5f9-ac9c667d1970.png" alt="" style="display:block;margin:0 auto" />

<p>In order to create a BigLake table we need to do the following:</p>
<p>Create a BQ Connection</p>
<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/33f54513-24a6-499a-89e7-3ee758ace597.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/af156ccc-042a-4608-97f5-e1cf619d0039.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/c4db9544-0deb-4897-b43e-70949a7ed853.png" alt="" style="display:block;margin:0 auto" />

<p>Now, we need to grant this service account access to the GCS bucket</p>
<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/559ecabd-1711-47ca-8700-bc7de5376cfb.png" alt="" style="display:block;margin:0 auto" />

<p>We can now create the BigLake table:</p>
<pre><code class="language-sql">CREATE EXTERNAL TABLE `learning.orders_biglake`
(
  order_id STRING,
  customer_id INT64,
  channel STRING,
  amount NUMERIC,
  discount_amount NUMERIC,
  created_at TIMESTAMP
)
WITH PARTITION COLUMNS
(
  order_date DATE,
  country STRING
)
WITH CONNECTION `projects/your-gcp-project/locations/eu/connections/demo-biglake-connection`
OPTIONS (
  format = 'CSV',
  skip_leading_rows = 1,
  field_delimiter = ',',
  hive_partition_uri_prefix = 'gs://datawise-biglake-hive-demo-bucket/orders',
  uris = ['gs://datawise-biglake-hive-demo-bucket/orders/*'],
  max_staleness = INTERVAL 1 DAY,
  metadata_cache_mode = 'AUTOMATIC'
);
</code></pre>
<p>A few things to note:</p>
<ul>
<li><p>max_staleness — how old the cache can be before BigQuery goes back to storage. Minimum is 15 minutes</p>
</li>
<li><p>metadata_cache_mode — AUTOMATIC refreshes on its own; MANUAL lets you trigger it via stored procedure, useful for event-driven pipelines</p>
</li>
<li><p>consider adding require_partition_filter to force callers to filter on a partition key and avoid full file scans</p>
</li>
</ul>
<p>The table is now created and can be queried like any other BigQuery table.</p>
<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/ed0dcfb0-6de1-4996-ac7f-a6793414b042.png" alt="" style="display:block;margin:0 auto" />

<h3><strong>Before you go</strong></h3>
<p>A couple of things worth knowing:</p>
<ul>
<li><p>an existing classic external table can be upgraded to a BigLake table without recreating it from scratch</p>
</li>
<li><p>metadata cache refreshes incur processing costs — worth keeping in mind if you're dealing with a large number of files</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[BigQuery Saves Your Query Results — Here's How to Find Them]]></title><description><![CDATA[Ever run a heavy BigQuery SQL query, processed gigabytes of data — and then accidentally closed the tab or forgot to save the results? 😬
Don't re-run it. Your results are still there.
BigQuery automa]]></description><link>https://datawise.dev/bigquery-saves-your-query-results-here-s-how-to-find-them</link><guid isPermaLink="true">https://datawise.dev/bigquery-saves-your-query-results-here-s-how-to-find-them</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Thu, 05 Mar 2026 07:34:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/568372e9-ef3d-486c-af8c-3112a8601ec3.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever run a heavy BigQuery SQL query, processed gigabytes of data — and then accidentally closed the tab or forgot to save the results? 😬</p>
<p><strong>Don't re-run it. Your results are still there.</strong></p>
<p>BigQuery automatically saves query results to an anonymous temporary table for <strong>24 hours</strong>. You can find it under <strong>Job Information → Temporary Table</strong>.</p>
<p>This is also the engine behind BigQuery's caching behavior: if you run the exact same query again within that window — with no changes — BigQuery serves results directly from cache, <strong>at no charge</strong>.</p>
<p><strong>One caveat worth knowing:</strong> if your result set exceeds <strong>10 GB</strong> (that's the output, not the data scanned), it won't be cached. So for very large result sets, you'll want to write results to a permanent table explicitly.</p>
<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/6951dc9e-636e-4687-bc09-c9a1fe5f59d4.png" alt="" style="display:block;margin:0 auto" />]]></content:encoded></item><item><title><![CDATA[Parameters in BigQuery]]></title><description><![CDATA[You can use query parameters in BigQuery hashtag#SQL (now in the console as well!) — but how are they different from variables, and when should you use each?
Both parameters and variables act as place]]></description><link>https://datawise.dev/parameters-in-bigquery</link><guid isPermaLink="true">https://datawise.dev/parameters-in-bigquery</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sat, 07 Feb 2026 12:37:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/vzbXoo0CuAs/upload/1df97efa0e7dba248f0b2e697e4fe6bf.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You can use query parameters in BigQuery hashtag#SQL (now in the console as well!) — but how are they different from variables, and when should you use each?</p>
<p>Both parameters and variables act as placeholders and have a defined data type. The difference is where their value comes from and how they’re used.</p>
<p>Parameters (like @corpus)<br />👉 Are not computed inside the query<br />👉 Are passed from the outside (Python, UI, API, etc.)</p>
<p>Variables (DECLARE, SET)<br />👉 Are defined and computed inside a SQL script or stored procedure<br />👉 Let you store a value and reuse it later in the same script</p>
<p>So what’s the real difference?<br />➡️ Variables are essential for Dynamic SQL (EXECUTE IMMEDIATE)<br />➡️ Parameters can filter data, but cannot control identifiers (e.g. table or column names)</p>
<p>🚨 Security<br />When values come from user input or external sources, parameters are the safer choice—they reduce the risk of SQL injection.</p>
<p>🚅 Performance<br />Parameters may allow the optimizer to reuse execution plans, while variables can sometimes prevent that.</p>
<img src="https://cdn.hashnode.com/uploads/covers/641c1535429c76261884ecba/587b5764-27e3-4ae9-b1b9-7d9491d2663e.png" alt="" style="display:block;margin:0 auto" />

<p><em>Found it useful? Check out to my Analytics newsletter at</em> <a href="https://notjustsql.com"><em>notjustsql.com</em></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Flattening JSON arrays in BigQuery]]></title><description><![CDATA[I've noticed that a new JSON function has been added (in Preview) in BigQuery SQL - JSON_FLATTEN().
It allows us to flatten JSON arrays and return a single flat ARRAY, no matter how many nested levels there are.
So where is this actually useful?➡️ Ha...]]></description><link>https://datawise.dev/flattening-json-arrays-in-bigquery</link><guid isPermaLink="true">https://datawise.dev/flattening-json-arrays-in-bigquery</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sun, 07 Dec 2025 12:57:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/19tQv51x4-A/upload/cb5db8fd2460224984ae2de7ca2f1bf8.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I've noticed that a new JSON function has been added (in Preview) in BigQuery SQL - JSON_FLATTEN().</p>
<p>It allows us to flatten JSON arrays and return a single flat ARRAY, no matter how many nested levels there are.</p>
<p>So where is this actually useful?<br />➡️ Handling heterogeneous JSON where the nesting depth isn’t consistent<br />➡️ Cleaning up malformed or jagged arrays<br />➡️ Normalizing data before UNNEST so you don’t get arrays of arrays</p>
<p>Where I would not use it?</p>
<p>👉 Don’t use it when the hierarchy matters. Flattening removes structural context, so you lose information about where an element came from.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1236/0*1UoE8j3SpwNK3K-V" alt /></p>
<p><em>Found it useful? Check out to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com"><em>notjustsql.com</em></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/the-json-datatype-in-bigquery">The JSON datatype in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/json-datatype-vs-json-like-string-in-bigquery">JSON datatype vs JSON-like STRING in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/extracting-keys-from-json-in-bigquery">Extracting keys from JSON in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/lax-json-conversion-functions-in-bigquery">LAX JSON conversion functions in BigQuery</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Using LAST_VALUE with STRUCTS]]></title><description><![CDATA[Even an “empty” STRUCT is still technically something. Not the same as a standalone NULL value.
This is why, if you work with STRUCTs in SQL and try to find the latest non-empty struct using LAST_VALUE(...) IGNORE NULLS, you’ll notice it doesn’t help...]]></description><link>https://datawise.dev/using-lastvalue-with-structs</link><guid isPermaLink="true">https://datawise.dev/using-lastvalue-with-structs</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sun, 05 Oct 2025 16:27:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/cWMhxNmQVq0/upload/eba4b6c159aeeab4c6938f0727e92a2a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Even an “empty” STRUCT is still technically something. Not the same as a standalone NULL value.</p>
<p>This is why, if you work with STRUCTs in SQL and try to find the latest non-empty struct using LAST_VALUE(...) IGNORE NULLS, you’ll notice it doesn’t help — because the struct, even when all fields are null, is still considered non-null.</p>
<p>LAST_VALUE only skips rows where the entire expression itself is NULL.</p>
<p>To fix this, we can adjust the logic in one of the following ways:<br />➡️ Setting the value to NULL when all fields are NULL<br />➡️ Using TO_JSON_STRING + NULLIF to treat such entries as “null”<br />➡️ Using REGEXP_CONTAINS (thanks ChatGPT) for more dynamic checks</p>
<p>Alternatively, we can just apply LAST_VALUE separately to each individual field in the struct.</p>
<p>If you're new to STRUCTs, see <a target="_blank" href="https://datawise.dev/understanding-structs-in-bigquery">one of my previous posts</a>.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:700/0*pIPD5kXi-zhsJzwD" alt /></p>
<p><em>Found it useful? Check out to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com"><em>notjustsql.com</em></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/beware-of-rownumber-without-order-by">Beware of ROW_NUMBER without ORDER BY</a></li>
<li><a target="_blank" href="https://datawise.dev/tidying-up-window-functions-in-bigquery-with-named-windows">Tidying up WINDOW functions in BigQuery with named windows</a></li>
<li><a target="_blank" href="https://datawise.dev/using-range-in-window-functions-in-bigquery">Using RANGE in Window Functions in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/computing-a-cumulative-sum-in-bigquery">Computing a cumulative sum in BigQuery</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Table grain quick validation with SQL]]></title><description><![CDATA[I was doing some exploratory data analysis on a number of tables I didn’t have much information about and, unfortunately, didn’t know their grain.
I needed a quick way to validate my assumptions about the table grain, identify contradicting observati...]]></description><link>https://datawise.dev/table-grain-quick-validation-with-sql</link><guid isPermaLink="true">https://datawise.dev/table-grain-quick-validation-with-sql</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sat, 04 Oct 2025 12:22:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/pcW5bR7gSJ4/upload/1348e2cccae0b01aab2ddf59f016dd1a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I was doing some exploratory data analysis on a number of tables I didn’t have much information about and, unfortunately, didn’t know their grain.</p>
<p>I needed a quick way to validate my assumptions about the table grain, identify contradicting observations (rows), and check for duplicates at the same time.</p>
<p>Therefore I decided to use a combination of TO_JSON_STRING and FARM_FINGERPRINT. The first creates a JSON representation of the entire row (given a table alias), while the second converts the resulting string into a INT64 hash.</p>
<p>By comparing the total number of rows in a group against the distinct count of these fingerprints, we can determine whether the proposed grain is correct and whether there are duplicates in the data.</p>
<p>This was a quick exercise but use this with care. Depending on your SQL implementation, data volumes and context, results may vary.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:700/0*r6BJhPYZrb7ivJg5" alt /></p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="http://notjustsql.com"><em>notjustsql.com</em></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/9-tips-on-writing-cleaner-sql">9 tips on writing cleaner SQL</a></li>
<li><a target="_blank" href="https://datawise.dev/order-of-precedence-in-sql-where-vs-having">Order of precedence in SQL: WHERE vs HAVING</a></li>
<li><a target="_blank" href="https://datawise.dev/easy-with-that-select-distinct">Easy with that SELECT DISTINCT!</a></li>
<li><a target="_blank" href="https://datawise.dev/why-you-should-use-parentheses-with-and-or-in-sql">Why you should use parentheses with AND &amp; OR in SQL</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[WITH expressions in BigQuery]]></title><description><![CDATA[So I recently discovered the WITH expression in BigQuery SQL.
Not to be confused with the WITH clause, which we use to define common table expressions (CTEs).
👉 What does a WITH expression do?It lets you define a series of variables, scoped to a sin...]]></description><link>https://datawise.dev/with-expressions-in-bigquery</link><guid isPermaLink="true">https://datawise.dev/with-expressions-in-bigquery</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Tue, 30 Sep 2025 05:38:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/GkinCd2enIY/upload/d3a3e22243bdd1ba959be7269ef16ffb.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>So I recently discovered the WITH expression in BigQuery SQL.</p>
<p>Not to be confused with the WITH clause, which we use to define common table expressions (CTEs).</p>
<p>👉 What does a WITH expression do?<br />It lets you define a series of variables, scoped to a single expression. Each variable can reference previously defined ones (and table columns), and in the end, the whole expression returns a result.</p>
<p>📌 Where could this be useful?<br />Think back to the time before QUALIFY was supported. We often had to create an extra CTE just to filter with WHERE rn = 1 or for similar windowed calculations. When QUALIFY came, it saved us a bunch of boilerplate CTEs.<br />Well, WITH expressions have the potential to help in the same way — but for non-window calculations.</p>
<p>When working with complex formulas, you can’t reference (within the same SELECT) a column you just defined. The usual workaround is to push it into another CTE — which works, but feels verbose. I still opted to do it since it's important that the code stayed readable and maintainable.<br />Now WITH expressions give us a cleaner option and help avoid those 7-operand expressions. I, for one, plan on trying them out ASAP.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:700/0*_xQceeRWtGCI3qkO" alt /></p>
<p>Has anyone here used them already? Any thoughts? Docs <a target="_blank" href="https://cloud.google.com/bigquery/docs/reference/standard-sql/operators#with_expression">here</a>.</p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Beware of ROW_NUMBER without ORDER BY]]></title><description><![CDATA[Haven’t posted all summer, but this bug pulled me straight out of the shadows.
I recently faced a mystery that pushed me to the edge of despair.
It seemed like a simple issue at first glance. A report kept changing completely at random.
I spent a goo...]]></description><link>https://datawise.dev/beware-of-rownumber-without-order-by</link><guid isPermaLink="true">https://datawise.dev/beware-of-rownumber-without-order-by</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Tue, 30 Sep 2025 05:20:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/WLNdV3xC-fI/upload/bacd9dd2a036c9003d8f37000c8475a0.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Haven’t posted all summer, but this bug pulled me straight out of the shadows.</p>
<p>I recently faced a mystery that pushed me to the edge of despair.</p>
<p>It seemed like a simple issue at first glance. A report kept changing completely at random.</p>
<p>I spent a good few days chasing it across multiple weeks . Imagine juggling several tables with time travel, all joined together. Trying to catch the bug.</p>
<p>I started to wonder if time travel even worked correctly. I wasn’t able to reproduce previous states, even when all the inputs had data from that exact point in time.</p>
<p>I began to question if I’d make it. The culprit?</p>
<p>Take this as a cautionary tale against using ROW_NUMBER() OVER(PARTITION BY …) without an accompanying ORDER BY.</p>
<p>Tucked into a table somewhere, it haunted me and wreaked havoc. I don’t know if there’s a real use case for it like that — but expect surprises.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759209508293/e382bdb5-55f0-4edd-a27f-0b9de55429f4.jpeg" alt class="image--center mx-auto" /></p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/tidying-up-window-functions-in-bigquery-with-named-windows">Tidying up WINDOW functions in BigQuery with named windows</a></li>
<li><a target="_blank" href="https://datawise.dev/using-range-in-window-functions-in-bigquery">Using RANGE in Window Functions in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/computing-a-cumulative-sum-in-bigquery">Computing a cumulative sum in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/rolling-period-calculation-in-bigquery">Rolling period calculation in BigQuery</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Aggregating Multiple SCD-2 Attribute Timelines in BigQuery]]></title><description><![CDATA[Here’s another practical BigQuery SQL exercise 💡
Say you have an input SCD2-style table with [valid_from, valid_to) and key-value attributes. Now you want to determine which attributes were valid at the same time for a given grain (e.g. id).
To do t...]]></description><link>https://datawise.dev/aggregating-multiple-scd-2-attribute-timelines-in-bigquery</link><guid isPermaLink="true">https://datawise.dev/aggregating-multiple-scd-2-attribute-timelines-in-bigquery</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Wed, 11 Jun 2025 19:22:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/bnZ8_95Q8NE/upload/885d24ef5d81125139dbf105bc1a6b95.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Here’s another practical BigQuery SQL exercise 💡</p>
<p>Say you have an input SCD2-style table with [valid_from, valid_to) and key-value attributes. Now you want to determine which attributes were valid at the same time for a given grain (e.g. id).</p>
<p>To do this, we:</p>
<p>1️⃣ Build an anchor table of all change points (start and end dates), grouped by id.</p>
<p>2️⃣ Generate date ranges using LEAD() over the change points, so we know the next boundary.</p>
<p>3️⃣ Join back to the original table to find which rows were active within each [valid_from, valid_to) segment.</p>
<p>4️⃣ Aggregate the key-value pairs as ARRAY&lt;STRUCT&lt;key, value&gt;&gt; to preserve temporal context.</p>
<p>We can now see that, for example, for the period between [2023-01-05, 2023-01-08), for id = 2, B was true and A was false.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:700/0*al6Q7c7B2GwZ3sZC" alt /></p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/practical-bigquery-joining-temporal-tables">Joining temporal tables in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/generating-a-compact-temporal-table-in-bigquery">Generating a compact temporal table in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/compacting-date-intervals-in-bigquery">Compacting date intervals in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/transforming-cumulative-sums-into-monthly-values">Transforming cumulative sums into monthly values</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Why you should think twice before UNNESTing arrays or date intervals]]></title><description><![CDATA[If you ever work with terabyte-scale data, try to avoid unnecessary unnesting/unpacking arrays and date ranges. If you have no choice, materialize the unnested result and partition and cluster it accordingly in preparation for any further joins.
With...]]></description><link>https://datawise.dev/why-you-should-think-twice-before-unnesting-arrays-or-date-intervals</link><guid isPermaLink="true">https://datawise.dev/why-you-should-think-twice-before-unnesting-arrays-or-date-intervals</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Mon, 09 Jun 2025 21:17:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1749669518998/e83a7529-3787-4993-a114-c30602ad09d0.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you ever work with terabyte-scale data, try to avoid unnecessary unnesting/unpacking arrays and date ranges. If you have no choice, materialize the unnested result and partition and cluster it accordingly in preparation for any further joins.</p>
<p>With that bold statement out there, here’s a BigQuery lesson drawn from real-world experience.</p>
<p>Imagine an SCD-2 table with tens of millions of rows containing product data.</p>
<p>For particular subintervals (even individual days) of the [valid_from, valid_to) period, we need to set some attributes (think: marking that the product is on sale, or is getting ready to be discontinued).</p>
<p>The simple and straightforward solution here would be to unpack the intervals with something like UNNEST(GENERATE_DATE_ARRAY(valid_from, valid_to)). Joining with a date dimension would work in a similar fashion.</p>
<p>This means that it is easy to now look up these attributes at (grain + valid_date).</p>
<p>But now we could be talking about several billion rows. Which you might join with other big tables.</p>
<p>The issues with this approach are quite obvious. Say we have a two-year interval. Unpacking this period would mean having 730 rows instead of one, while a majority of them would just be repeated data — there’s no actual change, just some spot changes for individual days or subintervals.</p>
<p>We can of course go ahead and compact this data back using an algorithm like the one I presented <a target="_blank" href="https://datawise.dev/compacting-date-intervals-in-bigquery">in one of my previous posts</a>, but regardless of whether we do that or not, the whole processing would be incredibly expensive and inefficient.</p>
<h2 id="heading-a-better-approach-work-with-intervals">A better approach: work with intervals</h2>
<p>So, is there a solution here? Sometimes there might be.</p>
<p>Say for the two-year interval [2021-01-01, 2023-01-01], we’d need to mark the subperiod [2021-04-15, 2021-05-15] as the period when the product is on sale.</p>
<p>This data can be accurately expressed as the following intervals:</p>
<p>[2021-01-01 → 2021-04-15),</p>
<p>[2021-04-15 → 2021-05-15),</p>
<p>[2021-05-15 → 2023-01-01).</p>
<p>So in fact, we’re splitting the intervals into smaller subintervals based on what was true at each moment. Instead of unpacking and getting hundreds of records, we need only 3. I’ve illustrated this approach <a target="_blank" href="https://datawise.dev/practical-bigquery-joining-temporal-tables">in another previous post</a>.</p>
<p>Such a difference in cardinality has deep performance implications, allowing us to process data way more efficiently.</p>
<p>While this might be an unusual corner case, my initial opinion stands — if you can get away without UNNESTing your array or unpacking a date interval, you’ll manage to keep cardinality in check and save a lot of processing power.</p>
<p>If there’s no way around it for your particular case, at the very least, materialize the result (I know, it’s going to be a lot of storage) and tailor the partitioning and clustering according to the querying (especially joining with other big+ tables) downstream.</p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/generating-a-compact-temporal-table-in-bigquery">Generating a compact temporal table in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/aggregating-multiple-scd-2-attribute-timelines-in-bigquery">Aggregating Multiple SCD-2 Attribute Timelines in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/using-array-agg-in-bigquery">Using ARRAY_AGG in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/short-almost-non-technical-guide-to-sql-query-tuning-as-a-data-engineer">Short, almost non-technical guide to SQL query tuning as a Data Engineer</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Cross-dataset foreign key relationships in BigQuery]]></title><description><![CDATA[It turns out you can now (don't know since when though) create cross-dataset foreign key relationships in BigQuery hashtag#SQL. Previously this was only possible for tables that are in the same dataset (but there were workarounds).
While the performa...]]></description><link>https://datawise.dev/cross-dataset-foreign-key-relationships-in-bigquery</link><guid isPermaLink="true">https://datawise.dev/cross-dataset-foreign-key-relationships-in-bigquery</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Mon, 09 Jun 2025 21:12:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/Rm3nWQiDTzg/upload/1af0e7259e348bd4243faad29e2dd788.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It turns out you can now (don't know since when though) create cross-dataset foreign key relationships in BigQuery hashtag#SQL. Previously this was only possible for tables that are in the same dataset (but there were workarounds).</p>
<p>While the performance gain when using these <em>unenforced</em> PK/FK constraints in general may be up for discussion, it's definitely nice to be able to see this table metadata there, including the table grain 👍</p>
<p>For a refresher on what these constraints are, see <a target="_blank" href="https://datawise.dev/bigquery-primary-key-foreign-key-constraints">my previous post</a>.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:700/0*6jgUdm3tK8Kmdvam" alt /></p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/row-level-access-security-in-bigquery">Row-level access security in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/using-subqueries-with-row-level-security-in-bigquery">Using subqueries with Row Level Security in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/why-basic-roles-in-bigquery-are-a-bad-idea">Why basic roles in BigQuery are a bad idea</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Compacting date intervals in BigQuery]]></title><description><![CDATA[Here's a practical BigQuery SQL exercise that highlights some important concepts as well is an interesting algorithm imho. I've pair programmed this with LLMs, if that's a thing 😎
Problem statement: compacting a SCD-2 table, essentially finding inte...]]></description><link>https://datawise.dev/compacting-date-intervals-in-bigquery</link><guid isPermaLink="true">https://datawise.dev/compacting-date-intervals-in-bigquery</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Mon, 09 Jun 2025 20:38:02 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/3cTpEK08lwg/upload/a0447ed619c674a9ef0250a0242bdc10.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Here's a practical BigQuery SQL exercise that highlights some important concepts as well is an interesting algorithm imho. I've pair programmed this with LLMs, if that's a thing 😎</p>
<p>Problem statement: compacting a SCD-2 table, essentially finding intervals that can be safely merged, turning two adjacent intervals with the same data into a single, bigger interval.</p>
<p>This particular input data guarantees these intervals cannot overlap (at the same grain), but there can be gaps. We're also talking about [left-inclusive, right-exclusive) intervals.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:700/0*xL2thW2sDi8R1WPf" alt /></p>
<p>Here's a breakdown of how it all works:</p>
<p>1️⃣ we're starting by computing a hash of all the column of interest, excluding the grain (in my example: flag_a, flag_b)<br />2️⃣ then we use LAG() over grain window to detect whether the current row starts right after the previous one and if the hashes (so the 'payload' of the two rows) match<br />3️⃣ we mark the start of a new "segment" when either:<br />- attributes have changed (flags differ), so hash being different<br />- intervals are not adjacent (there are gaps)</p>
<p>4️⃣ use a cumulative SUM() over grain window to group rows into segment IDs</p>
<p>5️⃣ collapse each segment using MIN(valid_from) and MAX(valid_to)</p>
<p>We can now see that in our example that several intervals were merged into bigger ones.</p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/practical-bigquery-joining-temporal-tables">Joining temporal tables in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/generating-a-compact-temporal-table-in-bigquery">Generating a compact temporal table in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/aggregating-multiple-scd-2-attribute-timelines-in-bigquery">Aggregating Multiple SCD-2 Attribute Timelines in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/transforming-cumulative-sums-into-monthly-values">Transforming cumulative sums into monthly values</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Null-safe comparison: IS DISTINCT/NOT DISTINCT FROM]]></title><description><![CDATA[I've been working for surprisingly long with SQL to have found this only a few days ago. Not long enough I guess 🤓.
I'm talking about the NULL-safe operators IS DISTINCT FROM and IS NOT DISTINCT FROM. I found about their existence from a Linkedin po...]]></description><link>https://datawise.dev/null-safe-comparison-is-distinctnot-distinct-from</link><guid isPermaLink="true">https://datawise.dev/null-safe-comparison-is-distinctnot-distinct-from</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Tue, 06 May 2025 07:32:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/MJAoiige14E/upload/55425a149a5b1659f0fd86d54b53737e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I've been working for surprisingly long with SQL to have found this only a few days ago. Not long enough I guess 🤓.</p>
<p>I'm talking about the NULL-safe operators IS DISTINCT FROM and IS NOT DISTINCT FROM. I found about their existence from a <a target="_blank" href="https://www.linkedin.com/posts/sebastian-flak_the-sql-comparison-operator-you-should-activity-7322288997945212928-3hOZ?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAAvrnvABKPsQ1CE0m9jhBpQ-Vr-YZbN9dqg">Linkedin post</a>.</p>
<p>Works on BigQuery too, so I guess less need of adding IFNULLs / COALESCE for safety.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:700/0*Wb4zGkbHhG9oEoVC" alt /></p>
<p>PS This choice of keyword "FROM", together with the one in EXTRACT(HOUR FROM DATETIME '2021-01-01'), feels pretty weird.</p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/a-couple-of-fun-things-about-null-in-sql">A couple of fun things about NULL in SQL</a></li>
<li><a target="_blank" href="https://datawise.dev/not-all-nulls-are-the-same">Not all NULLS are the same</a></li>
<li><a target="_blank" href="https://datawise.dev/coalesce-vs-ifnull-vs-nullif-in-bigquery">COALESCE vs IFNULL vs NULLIF in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/controlling-ordering-of-null-values-in-the-order-by-clause">Controlling ordering of NULL values in the ORDER BY clause</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Transforming cumulative sums into monthly values]]></title><description><![CDATA[Here’s a quick BigQuery SQL exercise. I often work with cumulative aggregations, but it’s not every day that I need to reverse them—converting cumulative values back into monthly figures.
Let's look at an example.
The dataset provides cumulative sale...]]></description><link>https://datawise.dev/transforming-cumulative-sums-into-monthly-values</link><guid isPermaLink="true">https://datawise.dev/transforming-cumulative-sums-into-monthly-values</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Thu, 27 Mar 2025 07:34:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/eDPAGdaJ-GQ/upload/017e3c48422f8e65dbed45f8e6bfe784.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Here’s a quick BigQuery SQL exercise. I often work with cumulative aggregations, but it’s not every day that I need to reverse them—converting cumulative values back into monthly figures.</p>
<p>Let's look at an example.</p>
<p>The dataset provides cumulative sales per fiscal year (July 1st - June 30th in this case). Our goal is to determine the actual sales for each month.</p>
<p>How do we do it?</p>
<ol>
<li><p>Identify the fiscal year each period belongs to. We can use a UDF (as shown) or retrieve this from a date dimension table.</p>
</li>
<li><p>Use the LAG window function to retrieve the previous cumulative value (partitioned by our grain + fiscal year and ordered by period).</p>
</li>
<li><p>Subtract the previous cumulative value from the current one to derive the actual monthly sales.</p>
</li>
</ol>
<p>• For the first month of a fiscal year, there’s no previous value, so we default to 0 in case of a NULL there.</p>
<p>Things to watch out for:<br />➡️ Gaps in the data: How do they impact the calculation? Are we okay with that?<br />➡️ Grain considerations: Do we need to do this per department? Per country? If so, adjust the PARTITION BY accordingly.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:700/0*Wuw8he37UZSOsKiT" alt /></p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/practical-bigquery-joining-temporal-tables">Joining temporal tables in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/generating-a-compact-temporal-table-in-bigquery">Generating a compact temporal table in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/aggregating-multiple-scd-2-attribute-timelines-in-bigquery">Aggregating Multiple SCD-2 Attribute Timelines in BigQuery</a></li>
<li><a target="_blank" href="https://datawise.dev/compacting-date-intervals-in-bigquery">Compacting date intervals in BigQuery</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[A quick walkthrough BigQuery Remote Functions]]></title><description><![CDATA[In a previous post, I mentioned Remote Functions—a powerful way to send data from BigQuery to an external service for processing, including a Cloud Run function.
This is especially useful when SQL lacks built-in support for your specific needs, and w...]]></description><link>https://datawise.dev/a-quick-walkthrough-bigquery-remote-functions</link><guid isPermaLink="true">https://datawise.dev/a-quick-walkthrough-bigquery-remote-functions</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sat, 08 Mar 2025 22:16:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/vUxSIkqveu8/upload/9ddbbdc1ce0a99d8ab9b0c1e1a933e01.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In <a target="_blank" href="https://datawise.dev/a-quick-overview-of-bigquery-functions">a previous post</a>, I mentioned Remote Functions—a powerful way to send data from BigQuery to an external service for processing, including a Cloud Run function.</p>
<p>This is especially useful when SQL lacks built-in support for your specific needs, and writing a UDF isn’t an option (for example, if you need a highly specialized Python function).</p>
<p>Before building your own, check out <a target="_blank" href="https://unytics.io/bigfunctions/">bigfunctions</a>—many common use cases have already been solved by others!</p>
<h2 id="heading-what-are-remote-functions-in-bigquery">What are remote functions in BigQuery?</h2>
<p>These are a special type of function that delegates processing of input to an external resource, allowing us to:</p>
<ul>
<li><p><strong>send</strong> <strong>data</strong> from BigQuery to Google Cloud Functions or other external services</p>
</li>
<li><p><strong>process</strong> <strong>it</strong> using a programming language</p>
</li>
<li><p><strong>return results</strong> to our query</p>
</li>
</ul>
<h2 id="heading-why-is-that-important">Why is that important?</h2>
<p>A Google Remote function can encapsulate any kind of logic in major programming languages.</p>
<p>This opens the door to vast a ecosystem of libraries such at the Python packages.</p>
<p>Let’s look at a step-by-step example of creating a Remote Function.</p>
<h2 id="heading-step-1-create-the-cloud-run-function">Step 1: Create the Cloud Run Function</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741471414177/6d8b849d-737f-4efe-a3ab-a39f1dd24e96.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-step-2-create-a-connection">Step 2: Create a connection</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741471757275/f7419114-e5fc-4feb-b080-184828d20a13.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741471864886/3b677c91-3f43-4831-9374-e23ab82cb978.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-step-3-set-up-permissions">Step 3: Set up permissions</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741471928975/1ef3beac-2918-4766-b21e-68949299d7b5.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-step-4-bind-the-connection-with-cloud-run-function">Step 4: Bind the connection with Cloud Run function</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741471980691/6b5ce09d-a1b1-4603-9d23-48d7556013bd.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-test-run">Test run</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741472093181/dd455fb7-9653-41b5-a54a-87460190feee.png" alt class="image--center mx-auto" /></p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Change history in BigQuery]]></title><description><![CDATA[Ever needed to track what changed in a table and when? In data engineering, this is known as Change Data Capture (CDC)—a fundamental challenge when dealing with evolving datasets.
Now, the Change History features in BigQuery sound pretty interesting....]]></description><link>https://datawise.dev/change-history-in-bigquery</link><guid isPermaLink="true">https://datawise.dev/change-history-in-bigquery</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Thu, 06 Mar 2025 15:44:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/5IHz5WhosQE/upload/8e51a5949b7b388bd6a3ae6487e90811.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever needed to track what changed in a table and when? In data engineering, this is known as Change Data Capture (CDC)—a fundamental challenge when dealing with evolving datasets.</p>
<p>Now, the Change History features in BigQuery sound pretty interesting.</p>
<p>BigQuery SQL has had the APPENDS table-valued function (TVF) for some time now, which works well for append-only scenarios. But it didn’t capture updates or deletes.</p>
<p>A few months ago a CHANGES TVF was added, which provides visibility into UPDATE and DELETE operations.</p>
<p>Unlike APPENDS (which works right out of the box), you need to enable change history tracking manually either at table creation or with an <code>ALTER TABLE ... SET OPTIONS()</code> command.</p>
<p>To illustrate how it all works I've:<br />1️⃣ Created a table 2️⃣ Inserted a row 3️⃣ Updated a row</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741275797472/f1c699d1-6bc0-470c-afb3-e785d3fa9e46.jpeg" alt class="image--center mx-auto" /></p>
<p>As you will be able to see:<br />✅ APPENDS captures new rows only.<br />✅ CHANGES logs updates too (as a DELETE + INSERT).</p>
<p>Key things to note:</p>
<p>⚠️ Both features are still in preview, so not production-ready.<br />💰 Querying this data still incurs processing costs.<br />⏳ CHANGES only tracks modifications older than 10 minutes.<br />📦 Enabling Change History means extra storage costs for metadata.</p>
<p>Has anyone tried using these in real-life scenarios?</p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[A quick overview of BigQuery functions]]></title><description><![CDATA[When we talk about functions in BigQuery, we're referring to several distinct capabilities.
Beyond the standard built-in functions like CURRENT_TIMESTAMP() or LENGTH(), BigQuery helps users to define custom functions that extend SQL capabilities. The...]]></description><link>https://datawise.dev/a-quick-overview-of-bigquery-functions</link><guid isPermaLink="true">https://datawise.dev/a-quick-overview-of-bigquery-functions</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sun, 02 Mar 2025 20:57:27 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/cHhbULJbPwM/upload/aff47b5d4d7d6ac7457150c035267f20.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When we talk about functions in <strong>BigQuery</strong>, we're referring to several distinct capabilities.</p>
<p>Beyond the standard built-in functions like CURRENT_TIMESTAMP() or LENGTH(), BigQuery helps users to define custom functions that extend SQL capabilities. These user-defined functions belong to the larger category of routines (alongside stored procedures), enabling logic reuse.</p>
<p>🔹 Types of Functions in BigQuery</p>
<p>By Duration</p>
<p>➡️ Persistent functions – Stored in your dataset and reusable across all sessions</p>
<p>➡️ Temporary functions – Available only within your current session (created with TEMP keyword)</p>
<p>By Return Type</p>
<p>➡️ Scalar functions – Return a single value per input row (which can be complex types like structs or arrays) → typically used in SELECT or WHERE clauses</p>
<p>➡️ Table-Valued Functions (TVFs) – Return entire tables, requiring you to SELECT FROM the function</p>
<p>BigQuery functions can be written in either SQL or JavaScript.</p>
<p>Based on their processing nature:</p>
<p>➡️ Regular UDFs – Process individual rows, transforming inputs into a single output value</p>
<p>➡️ User-Defined Aggregate Functions (UDAFs) – Combine multiple rows into a single result using custom logic (currently in preview)</p>
<p>🔹 Beyond BigQuery: Remote Functions</p>
<p>For complex processing requirements, BigQuery offers remote functions, which allows us to:</p>
<p>➡️ Send data to Google Cloud Functions or other external services</p>
<p>➡️ Process it using a programming language</p>
<p>➡️ Return results to our query</p>
<p>This opens access to the vast ecosystem of libraries in languages like Python.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740949017916/8a2445da-f34f-4a89-8dbc-966270b5d657.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-their-place-in-modern-sql">Their Place in Modern <strong>SQL</strong></h3>
<p>Back in the day when I just started with <strong>SQL Server</strong>, I used scalar functions sparingly (as a junior I was always warned about performance 🤓) and occasionally employed TVFs for small reusable datasets.</p>
<p>Today, with modern transformation frameworks like dbt and Dataform, I find myself almost not using BigQuery — the same reusable logic is now defined as macros or custom JS functions within these frameworks.</p>
<p>💡 I'm curious:</p>
<p>➡️ How often do you use UDFs or TVFs in your SQL environment?</p>
<p>➡️ Do you prefer handling reusable logic in your SQL code or in external frameworks?</p>
<p>➡️ Any interesting use cases you've seen for remote functions for unusual/specialized processing needs?</p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Expressing multiple repeated joins as a correlated subquery]]></title><description><![CDATA[In yesterday’s post, we looked at retrieving information from a table by joining it multiple times—each with different join criteria. This raises a natural question: are there better alternatives to this approach?
I initially experimented with a CASE...]]></description><link>https://datawise.dev/expressing-multiple-repeated-joins-as-a-correlated-subquery</link><guid isPermaLink="true">https://datawise.dev/expressing-multiple-repeated-joins-as-a-correlated-subquery</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sun, 23 Feb 2025 15:13:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/GsRaiFdcTY4/upload/53033ddb91c2a6155eb0ade4fda8c8d8.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In <a target="_blank" href="https://datawise.dev/revisiting-group-by-rollup-with-a-more-realistic-example">yesterday’s post</a>, we looked at retrieving information from a table by joining it multiple times—each with different join criteria. This raises a natural question: are there better alternatives to this approach?</p>
<p>I initially experimented with a CASE WHEN in the join condition, hoping it would short-circuit, picking the first matching condition—just like in a SELECT clause. However, in a join, it evaluates all scenarios, so that didn’t work as expected.</p>
<p>But remember correlated subqueries? A correlated subquery runs once per row and can be embedded in the SELECT or WHERE clause. Essentially, it lets you create a dynamic query within a single data cell, based on the current row’s context. Check out <a target="_blank" href="https://datawise.dev/using-correlated-subqueries-in-bigquery">this quick intro</a>.</p>
<p>To avoid multiple joins, you can use a correlated subquery to fetch all possible combinations (previously handled by join conditions) and apply the same logic with ORDER BY and LIMIT to return exactly one value.</p>
<p>A word of caution: correlated subqueries execute once per row, which can impact performance, especially with large datasets. However, they’re a valuable tool in your SQL tool belt, particularly when other elegant solutions aren’t available.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740323268891/8607984d-3445-4040-bfab-d5b29cd62f6a.jpeg" alt class="image--center mx-auto" /></p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/self-joins-in-sql">Self-joins in SQL</a></li>
<li><a target="_blank" href="https://datawise.dev/anti-joins-in-sql">Anti-joins in SQL</a></li>
<li><a target="_blank" href="https://datawise.dev/semi-joins-in-sql">SEMI-JOINS in SQL</a></li>
<li><a target="_blank" href="https://datawise.dev/non-equi-joins-in-sql">NON-EQUI joins in SQL</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Revisiting GROUP BY ROLLUP with a more realistic example]]></title><description><![CDATA[Ever had a random piece of knowledge from school suddenly click in a real-world scenario?
It felt like that for me remembering about ROLLUP a few days ago.
I wrote about GROUP BY ROLLUP roughly 1.5 years ago—one of my first posts here. At the time, i...]]></description><link>https://datawise.dev/revisiting-group-by-rollup-with-a-more-realistic-example</link><guid isPermaLink="true">https://datawise.dev/revisiting-group-by-rollup-with-a-more-realistic-example</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Sat, 22 Feb 2025 14:56:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/PB80D_B4g7c/upload/3acf41a5d990604edcf90a65ed72979e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever had a random piece of knowledge from school suddenly click in a real-world scenario?</p>
<p>It felt like that for me remembering about ROLLUP a few days ago.</p>
<p>I wrote about GROUP BY ROLLUP roughly 1.5 years ago—<a target="_blank" href="https://datawise.dev/using-rollup-in-bigquery">one of my first posts here</a>. At the time, it was unfamiliar to me, and I had no idea I’d ever need it. But this week, I finally encountered a real use case.</p>
<p>𝐓𝐡𝐞 𝐏𝐫𝐨𝐛𝐥𝐞𝐦</p>
<p>Imagine we have sales data for a retail store, where each product belongs to a subcategory and a category (e.g., Apples → Fruits → Food).</p>
<p>We want to compute the average ordered quantity per product, but with a hierarchical fallback:  </p>
<ol>
<li>If there’s no product-level data, use the subcategory average.  </li>
<li>If that’s missing, use the category average.  </li>
<li>If still unavailable, fall back to the overall average across all products.</li>
</ol>
<p>𝐇𝐨𝐰 𝐑𝐎𝐋𝐋𝐔𝐏 𝐇𝐞𝐥𝐩𝐬</p>
<p>When we GROUP BY ROLLUP (category, subcategory, product_id), we get multiple aggregation levels in one query:<br />✅ Per product<br />✅ Per subcategory<br />✅ Per category<br />✅ Across all rows</p>
<p>This allows us to build a lookup table, which we can use with multiple LEFT JOINs to apply the fallback logic.</p>
<p>𝐋𝐞𝐭'𝐬 𝐭𝐞𝐬𝐭 𝐢𝐭</p>
<p>Here’s how it works in practice:<br />• Apples → Direct sales data → AVG(quantity) = 6<br />• Mangoes → No past sales → Uses Fruits subcategory → AVG(quantity) = 4.67<br />• Cucumbers → No past sales, no Vegetables subcategory data → Uses Food category → AVG(quantity) = 4.67<br />• Washing Machine → No sales data, no relevant category → Uses overall average → AVG(quantity) = 6</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740236002221/32bcd67e-bad3-4d76-845a-1b5602877a50.jpeg" alt class="image--center mx-auto" /></p>
<p>𝐈𝐧 𝐥𝐢𝐞𝐮 𝐨𝐟 𝐚 𝐜𝐨𝐧𝐜𝐥𝐮𝐬𝐢𝐨𝐧</p>
<p>This was a fun experiment, but let’s be honest—this could also be done with window functions!</p>
<p>Still, ROLLUP provides an perspective, and I’m on the lookout for an even better use case.</p>
<p>Have you ever had an SQL feature suddenly “click” for you?</p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Revisiting Why SQL’s Order of Execution Matters]]></title><description><![CDATA[A few days ago I thought that the following SQL query would not work— I expected the window function result would be summed multiple times.
🚨 Turns out, I was wrong.
This was a great reminder of why understanding SQL’s order of execution is crucial!...]]></description><link>https://datawise.dev/revisiting-why-sqls-order-of-execution-matters</link><guid isPermaLink="true">https://datawise.dev/revisiting-why-sqls-order-of-execution-matters</guid><dc:creator><![CDATA[Constantin Lungu]]></dc:creator><pubDate>Fri, 21 Feb 2025 10:52:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/hOwcob_3dpc/upload/35a2410e2a39505e1432cb8a7777364f.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few days ago I thought that the following SQL query would not work— I expected the window function result would be summed multiple times.</p>
<p>🚨 Turns out, I was wrong.</p>
<p>This was a great reminder of why understanding SQL’s order of execution is crucial!</p>
<p>I expected SUM(SUM(val)) OVER (PARTITION BY id) to accumulate incorrectly, but SQL’s execution order ensures that:</p>
<p>1️⃣ The GROUP BY clause first aggregates SUM(val) at the id grain.</p>
<p>2️⃣ Then, the window function is applied to the grouped result—not the raw data. Since there’s only one row per id, the window function correctly returns the expected value.</p>
<p>SQL doesn’t “re-sum” the window function like I feared. Instead, it partitions over the already-aggregated values—exactly as it should.</p>
<p>🔍 Have you ever misjudged a query’s behavior?</p>
<p><img src="https://media.licdn.com/dms/image/v2/D4E22AQFbSlouY2_eng/feedshare-shrink_800/B4EZUoFw_jGYAg-/0/1740134355635?e=1743033600&amp;v=beta&amp;t=o7CElWZmK91L8ZF_O7APDU3TjUctBrwXGeikRN8htU8" alt="No alt text provided for this image" /></p>
<p><em>Found it useful? Subscribe to my Analytics newsletter at</em> <a target="_blank" href="https://notjustsql.com/"><strong><em>notjustsql.com</em></strong></a><em>.</em></p>
<hr />
<p><em>Enjoyed this? Here are some related articles you might find useful:</em></p>
<ul>
<li><a target="_blank" href="https://datawise.dev/9-tips-on-writing-cleaner-sql">9 tips on writing cleaner SQL</a></li>
<li><a target="_blank" href="https://datawise.dev/order-of-precedence-in-sql-where-vs-having">Order of precedence in SQL: WHERE vs HAVING</a></li>
<li><a target="_blank" href="https://datawise.dev/easy-with-that-select-distinct">Easy with that SELECT DISTINCT!</a></li>
<li><a target="_blank" href="https://datawise.dev/why-you-should-use-parentheses-with-and-or-in-sql">Why you should use parentheses with AND &amp; OR in SQL</a></li>
</ul>
]]></content:encoded></item></channel></rss>