我希望能够从蜂窝表中的域名中删除后缀。因为后缀有多种类型(.co.uk,.com,.ac.uk,.)我想知道什么是最好的方法?
发布于 2015-01-28 18:01:02
这听起来像是你想隔离域名的最后两部分,这可以用regexp_extract来完成。使用您发布的部分数据作为具体示例:
hive> DESCRIBE dns_data;
OK
fullname string
Time taken: 0.871 seconds, Fetched: 1 row(s)
hive> SELECT * FROM dns_data;
OK
a632.phobos.apple.com
a1554.phobos.apple.com
a1207.phobos.apple.com
a1877.phobos.apple.com
lnll.googlevideo.com
a8-19.clients.cdn13.com
r6---sn-aigllnzk.googlevideo.com
r1---sn-cg07lues.googlevideo.com
r4---sn-aiglln6k.googlevideo.com
r18---sn-aigllnel.googlevideo.com
r11---sn-aigllnly.googlevideo.com
r20---sn-aigllnsk.googlevideo.com
r2---sn-aigllnee.googlevideo.com
r14---sn-aigllnld.googlevideo.com
r19---sn-aigllnll.googlevideo.com
r14---sn-aigllnsd.googlevideo.com
Time taken: 0.589 seconds, Fetched: 16 row(s)可以以以下方式使用regexp_extract来隔离最后两个字符串:
hive> SELECT
> stripped,
> suffix,
> split(suffix, "\\.") AS suffix_split
> FROM (SELECT
> regexp_extract(fullname, "^(.*?)\\.((?:[^.]+\\.?){2})$", 1) AS stripped,
> regexp_extract(fullname, "^(.*?)\\.((?:[^.]+\\.?){2})$", 2) AS suffix
> FROM dns_data
> ) t;
...
... Lots of MapReduce spam
...
OK
a632.phobos apple.com ["apple","com"]
a1554.phobos apple.com ["apple","com"]
a1207.phobos apple.com ["apple","com"]
a1877.phobos apple.com ["apple","com"]
lnll googlevideo.com ["googlevideo","com"]
a8-19.clients cdn13.com ["cdn13","com"]
r6---sn-aigllnzk googlevideo.com ["googlevideo","com"]
r1---sn-cg07lues googlevideo.com ["googlevideo","com"]
r4---sn-aiglln6k googlevideo.com ["googlevideo","com"]
r18---sn-aigllnel googlevideo.com ["googlevideo","com"]
r11---sn-aigllnly googlevideo.com ["googlevideo","com"]
r20---sn-aigllnsk googlevideo.com ["googlevideo","com"]
r2---sn-aigllnee googlevideo.com ["googlevideo","com"]
r14---sn-aigllnld googlevideo.com ["googlevideo","com"]
r19---sn-aigllnll googlevideo.com ["googlevideo","com"]
r14---sn-aigllnsd googlevideo.com ["googlevideo","com"]
Time taken: 22.073 seconds, Fetched: 16 row(s)https://stackoverflow.com/questions/28193560
复制相似问题