What function do you use to explicitly de-duplicate in BigQuery? I normally use ROW_NUMBER(), but I've recently encountered a really interesting <a target="_blank" href="https://cloud.google.com/blog/topics/developers-practitioners/bigquery-admin-reference-guide-query-optimization">blog post</a> suggesting ARRAY_AGG might be more performant for the task.
The explanation given is that the <code>ORDER BY</code> is allowed to drop everything except the top record on each GROUP BY, making ARRAY_AGG more efficient.
Sure enough, I did give it a try on some sample data.
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701523417203/d85c84a8-7229-4fc5-a522-3392e47ea44a.png" alt class="image--center mx-auto" />
In the below example, we'd like to pick the latest date available per id. The <code>ROW_NUMBER</code> example is pretty straightforward - we partition by id and order by <code>ds_date</code> decreasingly, then use the <code>QUALIFY</code> clause to keep only the record we want.
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701523015713/950e8e52-6af0-4581-9515-eb0c05427db5.png" alt class="image--center mx-auto" />
The <code>ARRAY_AGG</code> example, while looking a bit more intimidating, does the same thing.
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701523032907/895ddea1-3efa-48ab-8c1a-95b21c37f1b5.png" alt class="image--center mx-auto" />
It turns out that the recommendation holds - slot time for the ARRAY_AGG version was only 40% of the ROW_NUMBER. Another day, another lesson learned.
Found it useful? Subscribe to my Analytics newsletter at <a target="_blank" href="https://www.notjustsql.com"><a href="http://notjustsql.com" class="autolinkedURL autolinkedURL-url" target="_blank">notjustsql.com</a></a>.

What function do you use to explicitly de-duplicate in BigQuery?  
I normally use ROW\_NUMBER(), but I've recently encountered a really interesting [blog post](https://cloud.google.com/blog/topics/developers-practitioners/bigquery-admin-reference-guide-query-optimization) suggesting ARRAY\_AGG might be more performant for the task.

The explanation given is that the `ORDER BY` is allowed to drop everything except the top record on each GROUP BY, making ARRAY\_AGG more efficient.

Sure enough, I did give it a try on some sample data.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1701523417203/d85c84a8-7229-4fc5-a522-3392e47ea44a.png align="center")

In the below example, we'd like to pick the latest date available per id. The `ROW_NUMBER` example is pretty straightforward - we partition by id and order by `ds_date` decreasingly, then use the `QUALIFY` clause to keep only the record we want.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1701523015713/950e8e52-6af0-4581-9515-eb0c05427db5.png align="center")

The `ARRAY_AGG` example, while looking a bit more intimidating, does the same thing.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1701523032907/895ddea1-3efa-48ab-8c1a-95b21c37f1b5.png align="center")

It turns out that the recommendation holds - slot time for the ARRAY\_AGG version was only 40% of the ROW\_NUMBER. Another day, another lesson learned.

*Found it useful? Subscribe to my Analytics newsletter at* [*notjustsql.com*](https://www.notjustsql.com)*.*

De-duplicating with ROW_NUMBER vs ARRAY_AGG

Data Engineer with a passion for transforming complex data landscapes into insightful stories. Here on my blog, I share insights, challenges, and the ever-evolving dance of technology and business.


Explore Datawise - a blog on Analytics, SQL BigQuery and Python. Dive deep into tutorials, case studies, and the latest trends.