Joining temporal tables in BigQuery

Photo by Waldemar on Unsplash

Joining temporal tables in BigQuery

In today’s practical BigQuery exercise, we’re going to look at how we can join multiple temporal tables in BigQuery. If the name temporal tables does not sound familiar, you might have heard about transaction tables or even a Slowly-Changing Type-2 table (typically dimension, but could also be a fact). In any case, chances are you might recognize it below:

+----+------------+------------+--------+
| id | valid_from | valid_to   | valueB |
+----+------------+------------+--------+
| 1  | 2022-04-01 | 2022-07-01 | b1     |
| 1  | 2022-07-01 | 2022-08-01 | b2     |
| 1  | 2022-08-01 | 9999-01-01 | b3     |
+----+------------+------------+--------+

The above table defines an interval that includes the right but not the left bounds, as follows:

[valid_from, valid_to)

Now, if we want to find out what was the valueB for ID = 1 for a particular moment in time, say - today - the query would be pretty straightforward:

SELECT
id,
valid_from,
valid_to,
valueB
FROM
table_b
WHERE
CURRENT_DATE() > valid_from AND valid_to > CURRENT_DATE()

which evaluates today to the following:

+----+------------+------------+--------+
| id | valid_from | valid_to   | valueB |
+----+------------+------------+--------+
| 1  | 2022-08-01 | 9999-01-01 | b3     |
+----+------------+------------+--------+

But what if we have multiple temporal tables like the above and we’d like to look at them together, displaying the data for ID = 1 (be that a store_id or a product_id). How can we join multiple temporal tables into one?

Problem statement

Given several temporal tables, table_a, table_b and table_c, presenting different time-bound contexts about the same entity (id):

-- table_a
+----+------------+------------+--------+
| id | valid_from | valid_to   | valueA |
+----+------------+------------+--------+
| 1  | 2022-01-01 | 2022-06-01 | a1     |
| 1  | 2022-06-01 | 2022-08-01 | a2     |
+----+------------+------------+--------+

-- table_b
+----+------------+------------+--------+
| id | valid_from | valid_to   | valueB |
+----+------------+------------+--------+
| 1  | 2022-04-01 | 2022-07-01 | b1     |
| 1  | 2022-07-01 | 2022-08-01 | b2     |
| 1  | 2022-08-01 | 9999-01-01 | b3     |
+----+------------+------------+--------+

-- table_c
+----+------------+------------+--------+
| id | valid_from | valid_to   | valueC |
+----+------------+------------+--------+
| 1  | 2022-02-01 | 2022-05-01 | c1     |
| 1  | 2022-05-01 | 2022-09-01 | c2     |
| 1  | 2022-09-01 | 9999-01-01 | c3     |
+----+------------+------------+--------+

compute a joined temporal table that looks as follows:

+------------+------------+----+--------+--------+--------+
| valid_from | valid_to   | id | valueA | valueB | valueC |
+------------+------------+----+--------+--------+--------+
| 2022-01-01 | 2022-02-01 | 1  | a1     |        |        |
| 2022-02-01 | 2022-04-01 | 1  | a1     |        | c1     |
| 2022-04-01 | 2022-05-01 | 1  | a1     | b1     | c1     |
| 2022-05-01 | 2022-06-01 | 1  | a1     | b1     | c2     |
| 2022-06-01 | 2022-07-01 | 1  | a2     | b1     | c2     |
| 2022-07-01 | 2022-08-01 | 1  | a2     | b2     | c2     |
| 2022-08-01 | 2022-09-01 | 1  |        | b3     | c2     |
| 2022-09-01 | 9999-01-01 | 1  |        | b3     | c3     |
+------------+------------+----+--------+--------+--------+

Trying it out

Let’s try and decompose this problem. In the below representation, I’ve marked with red the occurrence of an event — change of state for one of the three attributes we’re interested. We can deduce that an event for one attribute must become an event for all the other attributes, as it basically slices the other intervals into smaller intervals.

For instance, the value a1 has full overlap with value c1 , but limited overlap with values b1 and c2.

Graphical representation of the issue

We’re going to start our query by creating this list of events (validity start or ending) across all the attributes. Notice we’re interested in a unique list, therefore using distinct. The list needs to be created at our desired grain, in our case — by id .

WITH dates AS (

SELECT id, valid_from AS event_date FROM table_a

UNION DISTINCT

SELECT id, valid_to AS event_date FROM table_a

UNION DISTINCT

SELECT id, valid_from AS event_date FROM table_b

UNION DISTINCT

SELECT id, valid_to AS event_date FROM table_b

UNION DISTINCT

SELECT id, valid_from AS event_date FROM table_c

UNION DISTINCT

SELECT id, valid_to AS event_date FROM table_c

)

We will now join each table to our dates common table expression.

We’ll join on our grain (in our case: id ) and the date condition, the event_date needs greater or equal to valid_from but strictly less than valid_to.

In the SELECT part, we’ll use the event_date as the valid_from of the new raw. We’ll then use the LEAD window function to fetch the next valid_from entry (while partitioning by our grain and ordering by the event_date) which will become our valid_to. We include the attributes we need from each table — valueA, valueB and valueC.

We’ll also order the output by valid_from to get a nice chronological view of events.

SELECT
d.event_date AS valid_from,
LEAD(d.event_date, 1) OVER (PARTITION BY d.id ORDER BY d.event_date) AS valid_to,
d.id,
a.valueA,
b.valueB,
c.valueC

FROM dates d

LEFT JOIN table_a a ON d.id = a.id AND d.event_date >= a.valid_from AND d.event_date < a.valid_to

LEFT JOIN table_b b ON d.id = b.id AND d.event_date >= b.valid_from AND d.event_date < b.valid_to

LEFT JOIN table_c c ON d.id = c.id AND d.event_date >= c.valid_from AND d.event_date < c.valid_to

ORDER BY valid_from

The above query produces the following results:

+------------+------------+----+--------+--------+--------+
| valid_from | valid_to   | id | valueA | valueB | valueC |
+------------+------------+----+--------+--------+--------+
| 2022-01-01 | 2022-02-01 | 1  | a1     |        |        |
| 2022-02-01 | 2022-04-01 | 1  | a1     |        | c1     |
| 2022-04-01 | 2022-05-01 | 1  | a1     | b1     | c1     |
| 2022-05-01 | 2022-06-01 | 1  | a1     | b1     | c2     |
| 2022-06-01 | 2022-07-01 | 1  | a2     | b1     | c2     |
| 2022-07-01 | 2022-08-01 | 1  | a2     | b2     | c2     |
| 2022-08-01 | 2022-09-01 | 1  |        | b3     | c2     |
| 2022-09-01 | 9999-01-01 | 1  |        | b3     | c3     |
| 9999-01-01 |            | 1  |        |        |        |
+------------+------------+----+--------+--------+--------+

In the output, you’ll notice that there is an extra row for valid_from = 9999–01–01. This one can be excluded in the query above using QUALIFY, the query becoming:

SELECT
d.event_date AS valid_from,
LEAD(d.event_date, 1) OVER (PARTITION BY d.id ORDER BY d.event_date) AS valid_to,
d.id,
a.valueA,
b.valueB,
c.valueC

FROM dates d

LEFT JOIN table_a a ON d.id = a.id AND d.event_date >= a.valid_from AND d.event_date < a.valid_to

LEFT JOIN table_b b ON d.id = b.id AND d.event_date >= b.valid_from AND d.event_date < b.valid_to

LEFT JOIN table_c c ON d.id = c.id AND d.event_date >= c.valid_from AND d.event_date < c.valid_to

QUALIFY valid_to IS NOT NULL

ORDER BY valid_from
+------------+------------+----+--------+--------+--------+
| valid_from | valid_to   | id | valueA | valueB | valueC |
+------------+------------+----+--------+--------+--------+
| 2022-01-01 | 2022-02-01 | 1  | a1     |        |        |
| 2022-02-01 | 2022-04-01 | 1  | a1     |        | c1     |
| 2022-04-01 | 2022-05-01 | 1  | a1     | b1     | c1     |
| 2022-05-01 | 2022-06-01 | 1  | a1     | b1     | c2     |
| 2022-06-01 | 2022-07-01 | 1  | a2     | b1     | c2     |
| 2022-07-01 | 2022-08-01 | 1  | a2     | b2     | c2     |
| 2022-08-01 | 2022-09-01 | 1  |        | b3     | c2     |
| 2022-09-01 | 9999-01-01 | 1  |        | b3     | c3     |
| 9999-01-01 |            | 1  |        |        |        |
+------------+------------+----+--------+--------+--------+

We can now find out what was the value for all three attributes of our id at a particular point in time, for example:

SELECT
valid_from,
valid_to,
id,
valueA,
valueB,
valueC

FROM results

WHERE valid_from < DATE("2022-05-04") AND valid_to > DATE("2022-05-04")
+------------+------------+----+--------+--------+--------+
| valid_from | valid_to   | id | valueA | valueB | valueC |
+------------+------------+----+--------+--------+--------+
| 2022-05-01 | 2022-06-01 | 1  | a1     | b1     | c2     |
+------------+------------+----+--------+--------+--------+

If you like to play around with the query, please see the entire SQL script below (including the raw data).

Conclusion

In this short practical exercise, we’ve looked at joining multiple temporal tables into a single one.

Thanks for reading and stay tuned for other interesting stories.

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

Did you find this article valuable?

Support Constantin Lungu by becoming a sponsor. Any amount is appreciated!