Window functions are powerful. But they can also help us fill in missing data in BigQuery.
Let's say you have a sensor that records temperature and humidity. Unfortunately, it is quite unreliable, so sometimes it might not send one or both readings. You'd like to retain the last known reading for a measurement.
Here's how we can solve it:
- Leverage the LAST_VALUE window function.
- Specify the IGNORE NULLS clause
- Partition by sensor_id
so only we consider data from the same sensor
- Order the window by the timestamp column
- Define a ROWS condition to consider rows from the beginning of time up to and including the current row - this way, if we do have a current reading for this timestamp, we keep it.
Here's how it would look in SQL:
SELECT
sensor_id,
temperature,
humidity_percentage,
at_timestamp,
LAST_VALUE(temperature IGNORE NULLS) OVER (PARTITION BY sensor_id ORDER BY at_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS tr_temperature,
LAST_VALUE(humidity_percentage IGNORE NULLS) OVER (PARTITION BY sensor_id ORDER BY at_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS tr_humidity,
FROM input_data
ORDER BY at_timestamp
Previously, I've written another blog post solving a similar problem by leveraging NTH_VALUE
.
P.S. This would not work if you're trying to fill in a STRUCT for example. IGNORE NULLS does not regard STRUCT(NULL AS a, NULL AS b) the same as NULL, so you might need to unpack the STRUCT.
Happy querying!
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.