Picking between CTE, View and Temp Table in BigQuery
When it comes to choosing between different abstractions in your database system - be it CTE (common table expression), logical view, temporary table, materialized view, or a standalone staging table - what factors do you take into account?
Reflecting on my SQL Server work, here’s roughly my train of thought around these options:
- CTE: My go-to for cleaning up my queries and enhancing readability.
- Logical View: Ideal for scenarios where I intended to reuse my CTE, and the data volumes were manageable enough to not compromise performance.
- Table-Valued Variables: Perfect for smaller tasks (think a handful of variables), especially when I didn't have tempdb permissions. This sometimes overlapped with my use of CTEs. You can also add an index to it.
- Temp Table: This was the middle ground - not something I’d reuse across different workflows, yet substantial enough to demand more than what a CTE could offer without overloading tempdb. This also supported an index.
- Standalone Materialized Staging Table: My choice for heavy-duty tasks. Whether it was something reusable or an exceptionally large dataset, this option provided the robustness I needed, complete with partitioning and indexing capabilities.
Nowadays, here's how I think about the same situation in BigQuery:
- CTE: Even more of a staple in my toolkit, unless I hit a performance roadblock. Worth noting: it’s evaluated every time it’s called unless it’s recursive.
- Temp Table: Not using it that often, although I know it can be a cost-effective alternative since it doesn’t persist like a materialized staging table. So storage savings are possible with this option.
- Materialized Views: Yet to use one in production, but I’m excited about the flexibility and potential performance perks they bring to the table, also has some limitations. I think it would work great on pre-filtering or pre-aggregating a big result set.
- Staging Table: My go-to when a CTE is called upon multiple times (or similar CTEs exist in multiple pipelines), or when I need to have partitioning and clustering ahead of a big join.
- Logical View: When data volume isn’t an issue, and I’m looking to reuse and enhance readability, this is my pick. It also doubles up as a wrapper.
How does your decision tree look like? Which one do you use in which cases?