Using BigQuery hashing functions

ยท

3 min read

Today I want to highlight hash functions and showcase how I use them in BigQuery.

๐Ÿ” What's a Hashing Function?

It's an algorithm that takes an input and returns a fixed-size output. But there are a few very interesting particularities to that.

A hashing function is:

- fixed-size => you always get an output of the same size
- deterministic => for the same input, you'll always get the same output back
- fast => the hashing function should be quick and resource-efficient
- resistant => very difficult to find out what was the input based on the output - sensitive to change => even adding one letter to a page of text would produce a totally different output
- low chance of collision => it is very improbable to have two different inputs produce the same output.

Now, hashing functions have multiple use cases in security, cryptography and computer science in general, but I want to showcase how I'm using it in my day-to-day as a Data Engineer.

As one can deduct from the proprieties above, the hashing function can very efficiently compare (as in assessing if they are different or the same) vast amounts of information.
BigQuery offers a range of hash functions such as MD5, SHA1 and FARM_FINGERPRINT. So, Where is this useful?

1. Creating surrogate keys. Say you have two source systems, each with unique identifiers (like IDs), but that do have an overlap between themselves. One could do FARM_FINGERPRINT(CONCAT(source_system_name, source_id )) to obtain a unique identifier across all systems.

2. Use it in conjunction with WINDOW functions to determine whether something has changed between two snapshots. See below an example of how it's done.

WITH input_data AS (
  SELECT 1 AS ID, date('2020-01-01') AS Snapshot_Date, 'John Doe' AS Name, 'Berlin' AS City, '55000' AS Salary
  UNION ALL  
  SELECT 1 AS id, date('2022-04-01') AS Snapshot_Date,  'John Doe' AS Name, 'New York' AS City, '55000' AS Salary
  UNION ALL 
  SELECT 2 AS id, date('2022-07-01') AS Snapshot_Date, 'Jessica Taylor' AS Name, 'Paris' AS City, '70000' AS Salary
  UNION ALL
  SELECT 2 AS id, date('2022-08-01') AS Snapshot_Date,  'Jessica Taylor' AS Name, 'Paris' AS City, '70000' AS Salary
  UNION ALL
  SELECT 2 AS id, date('2022-09-01') AS Snapshot_Date, 'Jessica Taylor' AS Name, 'London' AS City , '70000' AS Salary
), 
compute_hash AS 
(
SELECT 
  ID, 
  Name, 
  City, 
  Salary,
  Snapshot_Date,
  FARM_FINGERPRINT(CONCAT(Name, City, Salary)) AS attribute_hash,
FROM input_data
)
SELECT 
  ID,
  Name, 
  City,
  Salary,
  Snapshot_Date,
LAG(attribute_hash) OVER (PARTITION BY ID ORDER BY Snapshot_Date) <> attribute_hash AS ChangedSinceLastSnapshot
FROM compute_hash

3. Ever had to write a MERGE statement where you need to compare 12 different attributes between source and target? Concatenate the attributes and pass them on to a function like FARM_FINGERPRINT and compare the outputs to see if there was any change. Even better if you have a hash like that already computed in each of the tables (staging source and target).

MERGE `learning.Customers` AS target

USING (

SELECT 
    1 AS CustomerId, 
    31 AS Age, 
    'John' AS FirstName, 
    'Doe' AS LastName, 
    'IE' AS Country, 
    150000 AS Salary, 
    '2021-01-01' AS FirstOrderDate

) AS source ON source.CustomerId = target.CustomerId

WHEN MATCHED AND 
-- OPTION 1
  --  (
  --   target.age <> source.age OR 
  --   target.firstname <> source.firstname OR
  --   target.lastname <> source.lastname OR
  --   target.country <> source.country OR
  --   target.salary <> source.salary OR  
  --   target.firstorderdate <> source.firstorderdate 
  --  )

-- OPTION 2
FARM_FINGERPRINT(CONCAT(
                        target.age, 
                        target.firstname, 
                        target.lastname, 
                        target.country, 
                        target.salary, 
                        target.firstorderdate )) <> 

FARM_FINGERPRINT(CONCAT(source.age, 
                        source.firstname, 
                        source.lastname, 
                        source.country, 
                        source.salary, 
                        source.firstorderdate ))

THEN UPDATE
SET 
  target.age = source.age,
  target.firstname = source.firstname,
  target.lastname = source.lastname,
  target.country = source.country, 
  target.salary =  source.salary,  
  target.firstorderdate = source.firstorderdate

P.S. Also note that you can hash an entire row in SQL using TO_JSON_STRING:

SELECT t.*, FARM_FINGERPRINT(TO_JSON_STRING(t)) AS row_hash
FROM learning.Customers AS t;

Thanks for reading!

Did you find this article valuable?

Support Constantin Lungu by becoming a sponsor. Any amount is appreciated!

ย