我有一个包含1000行和3个变量( ID、国家和字符串变量"VAR1")的表。VAR1是由一个空格分隔的单词组成的句子。
我想按国家来统计所有的夫妻(或三胞胎)的单词。非常重要的是,夫妻(或三胞胎)是所有单词的交叉(不一定是一步一步地)。也许,我们可以用ngram hql-函数来实现它,但是当我使用它时,它是一步一步地计算单词,而不是所有的交叉。
让我们来看看这个例子,让你知道我想要的是什么:
> **"ID" "COUNTRY" "VAR1"**
> "1" "CANADA" "dad mum child"
> "2" "CANADA" "dad mum dog"
> "3" "USA" "bird lion car"VAR1不一定是三个字的长度。只是为了简化。
我要分四个步骤来做2- in的检查:
步骤1:最重要的一步:用2横过单词
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "dad child" 1
> "1" "CANADA" "mum dad" 1
> "1" "CANADA" "mum child" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "mum dad" 1
> "2" "CANADA" "mum dog" 1
> "2" "CANADA" "dog dad" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "lion bird" 1
> "3" "USA" "lion car" 1
> "3" "USA" "car bird" 1
> "3" "USA" "car lion" 1 步骤2:订购2克
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dog mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "car lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "car lion" 1 步骤3:按ID、国家、2纳克区分
> "1" "CANADA" "dad mum"
> "1" "CANADA" "child dad"
> "1" "CANADA" "child mum"
> "2" "CANADA" "dad mum"
> "2" "CANADA" "dad dog"
> "2" "CANADA" "dog mum"
> "3" "USA" "bird lion"
> "3" "USA" "bird car"
> "3" "USA" "car lion" 步骤4:按国家计算,2纳克
> "CANADA" "dad mum" 2
> "CANADA" "child dad" 1
> "CANADA" "child mum" 1
> "CANADA" "dad dog" 1
> "CANADA" "dog mum" 1
> "USA" "bird lion" 1
> "USA" "bird car" 1
> "USA" "car lion" 1非常感谢
发布于 2017-09-14 09:51:22
with cte as
(
select t.ID
,t.COUNTRY
,pe.pos
,pe.val
from mytable t
lateral view posexplode (split(VAR1,'\\s+')) pe
)
select t1.COUNTRY
,concat_ws(' ',t1.val,t2.val) as combination
,count (*) as cnt
from cte t1
join cte t2
on t2.id =
t1.id
where t1.pos < t2.pos
group by t1.COUNTRY
,t1.val
,t2.val
;+----------+--------------+------+
| country | combination | cnt |
+----------+--------------+------+
| CANADA | dad child | 1 |
| CANADA | dad dog | 1 |
| CANADA | dad mum | 2 |
| CANADA | mum child | 1 |
| CANADA | mum dog | 1 |
| USA | bird car | 1 |
| USA | bird lion | 1 |
| USA | lion car | 1 |
+----------+--------------+------+https://stackoverflow.com/questions/46200410
复制相似问题