In this practical BigQuery exercise, we’re going to look at what are recursive CTEs and how can they be useful in our tool set.
What is a CTE?
First of all, what is a CTE? Well, it stands for Common Table Expression and is essentially a named query that can be referenced multiple times in the same query.
An example of a Common Table Expression (CTE) can be seen below:
WITH people AS
( SELECT 'John' AS first_name, 'Doe' AS last_name
UNION ALL
SELECT 'Jane' AS first_name, 'Dow' AS last_name)
SELECT
first_name,
last_name
FROM people
CTEs are used widely for improving query readability as well as making debugging easier.
Enter the recursive variety
Adding the RECURSIVE keyword after WITH enables us to reference a CTE from within itself, essentially creating a recursion, hence the name.
Let’s consider an example:
WITH RECURSIVE cte AS
(
SELECT 1 AS id
UNION ALL
SELECT id + 1
FROM cte
WHERE id < 10
)
SELECT id FROM cte
ORDER BY id
What does it do? We’re essentially generating a list of integers 1 through 10.
In the above query, we start by selecting ID = 1 as our base case — ‘anchor’, and UNION to it all the cases building from it, by adding 1 to the previous case.
Notice the WHERE id < 10 . This is very important as it tells when to stop the recursion.
The same can be done to say generate a list of 100 consecutive days. In practice though, the same can be achieved using functions such as GENERATE_ARRAY or GENERATE_DATE_ARRAY.
Now where a recursive CTE would be really useful is where we have hierarchical data.
Recursive CTEs and hierarchical data
Consider the following data. We have a list of employees and the ID of the manager they report to. Our task is to retrieve each person’s manager's first and last name.
One way to do it would be to self-join the table multiple times. But how would you know how many times you’d need to self-join or what if there is a new hierarchical level down the line?
Here’s what using a recursive CTE to solve this would look like.
WITH RECURSIVE hierarchy as (
SELECT
employee_id,
first_name,
last_name,
manager_id,
0 AS level,
CAST(NULL AS STRING) AS manager_first_name,
CAST(NULL AS STRING) AS manager_last_name
FROM `learning.employee_data`
WHERE manager_id IS NULL
UNION ALL
SELECT
e.employee_id,
e.first_name,
e.last_name,
e.manager_id, level + 1 AS level,
h.first_name AS manager_first_name,
h.last_name AS manager_last_name
FROM `learning.employee_data` e
INNER JOIN hierarchy h ON e.manager_id = h.employee_id
)
SELECT
employee_id,
first_name,
last_name,
manager_id,
level,
manager_first_name,
manager_last_name
FROM hierarchy
We start with the anchor record (top-level employee, our CEO) and then traverse the employee hierarchy by joining the manager_id with the employee_id. We’ve also created a new ‘level’ column to show the hierarchical level — the distance between the employee and the CEO.
Performance considerations
It should be noted that according to BigQuery documentation, as opposed to non-recursive CTEs, the recursive CTEs are materialized (so executed only once).
GoogleSQL only materializes the results of recursive CTEs, but does not materialize the results of non-recursive CTEs inside the
WITH
clause. If a non-recursive CTE is referenced in multiple places in a query, then the CTE is executed once for each reference. TheWITH
clause with non-recursive CTEs is useful primarily for readability.
Conclusion
In today’s exercise, we’ve looked at recursive Common Table Expressions, or CTEs as yet another valuable tool on our belt when working with data in BigQuery.
Thanks for reading and stay tuned for more BigQuery topics.
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.