De-duplicating with ROW_NUMBER vs ARRAY_AGG

What function do you use to explicitly de-duplicate in BigQuery?
I normally use ROW_NUMBER(), but I've recently encountered a really interesting blog post suggesting ARRAY_AGG might be more performant for the task.

The explanation given is that the ORDER BY is allowed to drop everything except the top record on each GROUP BY, making ARRAY_AGG more efficient.

Sure enough, I did give it a try on some sample data.

In the below example, we'd like to pick the latest date available per id. The ROW_NUMBER example is pretty straightforward - we partition by id and order by ds_date decreasingly, then use the QUALIFY clause to keep only the record we want.

The ARRAY_AGG example, while looking a bit more intimidating, does the same thing.

It turns out that the recommendation holds - slot time for the ARRAY_AGG version was only 40% of the ROW_NUMBER. Another day, another lesson learned.

Found it useful? Subscribe to my Analytics newsletter at