我们是一个链接全球人工智能领域顶尖工程师、学者以及科学家的社群。
SQL中md5()大家可能都很熟悉,但是你真正了解它,会充分使用它吗?不见得。如果您不大熟悉也没关系,我们做了一个基本的介绍,让你有个概念。实际这个功能暗藏玄机,一旦被好好利用,真的是大有用处。 Find out why md5() is an SQL function that's used surprisingly often, and find out how -- and why -- you can use it yourself. Over the past nine months I’ve worked with over a dozen venture-funded startups to build out their internal analytics. In doing so, there’s a single SQL function that I have come to use surprisingly often. At first it wasn’t at all clear to me why I would want to use this function, but as time goes on I have found ever more uses for it. What is it?md5(). If you’re not familiar, here’s an example snippet from the Redshift docs: select md5('Amazon Redshift'); md5----------------------------------f7415e33f972c03abd4f3fed36748f7a (1 row) Give md5() a varchar and it returns itsMD5 hash. **... but seemingly pointless.Why exactly would I want to use that?! Great question. In this post I’m going to show you two uses for md5() that make it one of the most powerful tools in my SQL kit. #1: Building Yourself a UniqueID
It’s extremely common for this not to be the case. One reason is that your source data doesn’t have a unique keyif you’re syncing advertising performance data from Facebook Ads viaStitchorFivetran, the source data in your ad_insights table doesn’t have a unique key you can rely on. Instead, you have a combination of fields that is reliably unique (in this case date and ad_id). Using that knowledge, you can build yourself a unique id using md5(): select
md5(date_start::varchar || ad_id::varchar) as insight_idfrom
stitch_fb_ads.facebook_ads_insights
limit 5;
insight_id----------------------------------6d475ea96f23b097b51ed500116d8c5e 822c9429eabb28ccbcd7286836d7cd60 8b7fcd2aff879772ccac4f0f8bcb6a45 8a2cfd7eb1a723c49db47232e73ca29c 10338719dfadb3d4c9d44c608063998a
(5 rows)
The resulting hash is a meaningless string of alphanumeric text that functions as a unique identifier for your record. Of course, you could just as easily just create a single concatenated varchar field that performed the same function, butit’s actually important to obfuscate the underlying logic behind the hash: you will innately treat the field differently if it looks like an id versus if it looks like a jumble of human-readable text. There are a couple of reasons why creating a unique id is an important practice:
We create unique keys for every table and then test uniqueness on this key usingdbt schema tests. We run these tests multiple times per day onSinterand get notifications for any failures. This allows us to be completely confident of the analytics we implement on top of these data models. #2: Simplifying ComplexJoins
In this case, your available join keys are the date and yourUTM parameters(utm_medium, source, campaign, etc). Seems easy, right? Just do a join on all 6 fields and call it a day. Instead, use md5(). In both datasets, you can take the 6 fields we mentioned and concatenate them together into a single string, and then call md5() on the entire string. Here’s a code snippet from a client project where we did exactly this: You can see that this code is actually building the id on top of even more fields: in this example we’re actually unioning together advertising spend data from 7 different ad channels, and the data from Bing and Adwords is identified by ad_group_id and keyword_id instead of by UTM parameters. The approach extends cleanly. In the sessions table, you then create the exact same hashed id field.The resulting join is **, readable, and easy to use for downstream analysis: Resources
Thanks for reading! I’m definitely curious to hear if anyone has any additional clever uses for md5(). Bio:Tristan Handyis Founder and President ofFishtown Analytics. He builds open source tools for advanced analytics. 之前发布的Data Scientist面试问题其余答案正在整理中,会再下周公布,如果大家有感兴趣的话题可以给小编留言,您的意见是我们不断提高的动力,感谢支持! 关注本公众号,回复“人工智能 ”, 了解最新的人工智能行业动态(独家分享,不容错过) |
|
声明:文章版权归原作者所有 部分文章转自互联网 如有侵权请联系
[邮箱地址] 删除
|