首页 ›存档› 技术 › 查看内容

SQL中最Underutilized的功能，却暗藏玄机大有用处我们是一个链接全球人工智能领域顶 ...

2018-3-30 13:00 |来自: 互联网 373 0

摘要: 我们是一个链接全球人工智能领域顶尖工程师、学者以及科学家的社群。 SQL中md5()大家可能都很熟悉，但是你真正了解它，会充分使用它吗？不见得。如果您不大熟悉也没关系，我们做了一个基本的介绍，让你有个概念。实 ...

我们是一个链接全球人工智能领域顶尖工程师、学者以及科学家的社群。

SQL中md5()大家可能都很熟悉，但是你真正了解它，会充分使用它吗？不见得。如果您不大熟悉也没关系，我们做了一个基本的介绍，让你有个概念。实际这个功能暗藏玄机，一旦被好好利用，真的是大有用处。

Find out why md5() is an SQL function that's used surprisingly often, and find out how -- and why -- you can use it yourself.

Over the past nine months I’ve worked with over a dozen venture-funded startups to build out their internal analytics. In doing so, there’s a single SQL function that I have come to use surprisingly often. At first it wasn’t at all clear to me why I would want to use this function, but as time goes on I have found ever more uses for it.

What is it?md5(). If you’re not familiar, here’s an example snippet from the Redshift docs:

select md5('Amazon Redshift');
md5----------------------------------f7415e33f972c03abd4f3fed36748f7a
(1 row)

Give md5() a varchar and it returns itsMD5 hash. **... but seemingly pointless.Why exactly would I want to use that?!

Great question. In this post I’m going to show you two uses for md5() that make it one of the most powerful tools in my SQL kit.

#1: Building Yourself a UniqueID

I’m going to make a really strong statement here, but it’s one that I really believe in:every singledata modelin your warehouse should have a rock solid unique ID.

It’s extremely common for this not to be the case. One reason is that your source data doesn’t have a unique keyif you’re syncing advertising performance data from Facebook Ads viaStitchorFivetran, the source data in your ad_insights table doesn’t have a unique key you can rely on. Instead, you have a combination of fields that is reliably unique (in this case date and ad_id). Using that knowledge, you can build yourself a unique id using md5():

select 
    md5(date_start::varchar || ad_id::varchar) as insight_idfrom 
    stitch_fb_ads.facebook_ads_insights
limit 5;
insight_id----------------------------------6d475ea96f23b097b51ed500116d8c5e     822c9429eabb28ccbcd7286836d7cd60     8b7fcd2aff879772ccac4f0f8bcb6a45     8a2cfd7eb1a723c49db47232e73ca29c     10338719dfadb3d4c9d44c608063998a
(5 rows)

The resulting hash is a meaningless string of alphanumeric text that functions as a unique identifier for your record. Of course, you could just as easily just create a single concatenated varchar field that performed the same function, butit’s actually important to obfuscate the underlying logic behind the hash: you will innately treat the field differently if it looks like an id versus if it looks like a jumble of human-readable text.

There are a couple of reasons why creating a unique id is an important practice:

One of the most common causes of error is duplicate values in a key that an analyst was expecting to be unique. Joins on that field will “fan out” a result set in unexpected ways and can cause significant error that is difficult to troubleshoot. To avoid this, only join on fields where you’ve validated the cardinality and constructed a unique key where necessary.
Some BI tools require you to have a unique key in order to provide certain functionality. For instance, Lookersymmetric aggregatesrequire a unique key in order to function.

We create unique keys for every table and then test uniqueness on this key usingdbt schema tests. We run these tests multiple times per day onSinterand get notifications for any failures. This allows us to be completely confident of the analytics we implement on top of these data models.

#2: Simplifying ComplexJoins

This case is similar to #1 in its execution but it solves a very different puzzle. Imagine the following case. You have the same Facebook Ads dataset as referenced earlier but this time you have a new challenge: join that data to data in your web analytics sessions table so that you can calculate FacebookROAS.

In this case, your available join keys are the date and yourUTM parameters(utm_medium, source, campaign, etc). Seems easy, right? Just do a join on all 6 fields and call it a day.
Unfortunately that doesn’t work, for a really ** reason: it’s extremely common for some subset of those fields to be null, and a null doesn’t join to another null. So, that 6-field join is a dead end. You can hack together something incredibly complicated using a bunch of conditional logic, but that code is hideous and performs terribly (I’ve tried it).

Instead, use md5(). In both datasets, you can take the 6 fields we mentioned and concatenate them together into a single string, and then call md5() on the entire string. Here’s a code snippet from a client project where we did exactly this:

You can see that this code is actually building the id on top of even more fields: in this example we’re actually unioning together advertising spend data from 7 different ad channels, and the data from Bing and Adwords is identified by ad_group_id and keyword_id instead of by UTM parameters. The approach extends cleanly.

In the sessions table, you then create the exact same hashed id field.The resulting join is **, readable, and easy to use for downstream analysis: