Data is collected and processed in a number of ways, and it should come as no surprise that it's not always perfect.
Perhaps the most important thing you need to do before analyzing data is have a look at how it's presented and check for irregularities.
Before any sound analysis a great deal of attention needs to be paid to cleaning the data.
Take string columns for instance. In hashtag#BigQuery, as with other engines, there is a wealth of functions helping you to process strings, including:
- TRIM/RTRIM/LTRIM for getting rid of the whitespace
- REPLACE to replace a substring with another one
- UPPER/LOWER/NORMALIZE etc to control casing
- SUBSTR/SUBSTRING to cut strings and so on.
The main goal here is to bring everything to a common denominator, being able to tell which observations belong together and which data can be considered "missing".
Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.