我有两个具有以下模式的数据框
df1:
root
|-- story_id: string (nullable = true)
|-- uri: string (nullable = true)df2:
root
|-- uri: string (nullable = true)下面是来自这两个项目的样本数据
df1:
|story_id |uri |
+----------------------------------+---------------------------------------------------------------------------------------------------------------+
|0x5883d731edac1ca0d115c1ef05fb45e3|https://ilovupdates.com/i-know-what-you-did-last-summer-amazon-originals/|
|0x5883d731edac1ca0d115c1ef05fb45e3|https://ilovupdates.com/annabelle-sethupathi-2021-full-movie-download/|
|0x5883d731edac1ca0d115c1ef05fb45e3|https://ilovupdates.com/vicky-kaushal-biography/|
+----------------------------------+---------------------------------------------------------------------------------------------------------------+df2
+----------------------------------------------------------------------------------+
|uri |
+----------------------------------------------------------------------------------+
|https://tvnewsalert.com/topics/entertainment/ |
|https://tvnewsalert.com/private-vaccine-verification-app-portpass-sparks-privacy-security-concerns/|
|https://tvnewsalert.com/kota-factory-season-2-web-series-download-leaked-on-moviesflix-480p/ |
+----------------------------------------------------------------------------------+本质上,df1包含多个对应于给定故事id的urls,并且df1中有多个故事id。
df2包含一组主urls。
任务是找出在df2的主url集中有多少来自df1中每个故事id的url。
因此,对于像这样的输入:
df1
s1, url1
s1, url2
s1, url3
s2, url2
s2, url4df2
url1
url3
url4输出应为
s1, 2
s2, 1发布于 2021-09-29 10:10:16
任务是找出df1中每个故事id中有多少个url出现在df2的主url集中。
在分组和计算uri的数量之前,您可以加入uri以确定哪个story_id存在于两者中
output_df = df1.join(df2,['uri'],"inner").groupBy("story_id").count()
output_df.show()如果这对你有效,请告诉我。
https://stackoverflow.com/questions/69374171
复制相似问题