我被一些电影片名的过滤搞糊涂了。我的问题是我有很多不同的电影标题,例如:
Movies: Visitors:
Breaking Dawn Part 1+2 100
Breaking Dawn 1+2 40
Breaking Dawn 1 + 2 30
Dark Knight trilogy 3D 100
Dark Knight trilogy 3D 40
Dark Knight Trilogy HDF 30
Dark Knight Trilogy -HDF 100
Dark Knight trilogy_ (blank) 44
etc. +10000所以有很多不同的电影片名,它们的名字不是唯一的,结尾也有一些空格。我可以稍微修复一下这个问题,但已经有很多标题,它们的含义相同,但却不同。目前我的疑问是:
SELECT regexp_replace(rtrim(allcinemadata.title)," - 3D | - 3D |3D |3D| 3D| - (3D) | - (3D) |(3D)"|"")
as clearTitle
FROM default.allcinemadata
group by
regexp_replace(rtrim(allcinemadata.title)," - 3D | - 3D |3D |3D| 3D| - (3D) | - (3D) |(3D)", "") 我不确定这是否是解决这个问题的最佳方案。希望你们能帮助我!:)
发布于 2017-03-04 05:21:02
我不能用更大的数据集测试这一点,但它适用于您在问题中提供的样本数据。基于soundex函数获取电影名称的视图总数,sql的其余部分是自解释的。
WITH movviews
AS (SELECT moviename,
totalviews,
Rank()
OVER (
partition BY Soundex(moviename)
ORDER BY totalviews DESC) rnk
FROM (SELECT moviename,
Sum(views)
OVER (
partition BY Soundex(moviename)
ORDER BY moviename) AS totalviews,
views
FROM movieviews
ORDER BY moviename)vv)
SELECT movviews.moviename,
movviews.totalviews
FROM movviews
WHERE rnk = 1 输出
movviews.moviename movviews.totalviews
Breaking Dawn Part 1+2 170
Dark Knight trilogy_ (blank) 314
Time taken: 62.257 seconds, Fetched: 2 row(s)
hive (default)> https://stackoverflow.com/questions/31312902
复制相似问题