我有城市的数据
country cities
UK [London,London Luton, Luton]
UK [London,London Gatwick, Gatwick]及参考资料机场
city airport coords
London London Luton 12.51
London London Gatwick 100.32我希望将城市列中的值列表与参考df中的机场列相匹配。如果匹配,从引用df获取相关机场名称并进行协调。
城市df所需输出的示例
country cities airport coords
UK [London,London Luton, Luton] London Luton 12.51
UK [London,London Gatwick, Gatwick] London Gatwick 100.32Explanation:
[London,**London Luton**, Luton]来自城市与**London Luton**在机场相配
我已经探索了一些选择,但没能真正做到。有人能帮忙吗?谢谢
发布于 2022-03-02 14:21:42
您可以使用包含标记包含搜索字符串的行,最后对返回True的行进行筛选。
数据准备
d1 = {
'cities':[
['London','London Luton', 'Luton'],
['London','London Gatwick', 'Gatwick']
],
'country':['UK','UK']
}
d2 = {
'country':['UK','UK'],
'city':['London','London'],
'airport':['London Luton','London Gatwick'],
'coords':[12.51,100.32]
}
sparkDF1 = sql.createDataFrame(pd.DataFrame(d1))
sparkDF2 = sql.createDataFrame(pd.DataFrame(d2))
sparkDF1.show(truncate=False)
+---------------------------------+-------+
|cities |country|
+---------------------------------+-------+
|[London, London Luton, Luton] |UK |
|[London, London Gatwick, Gatwick]|UK |
+---------------------------------+-------+
sparkDF2.show()
+-------+------+--------------+------+
|country| city| airport|coords|
+-------+------+--------------+------+
| UK|London| London Luton| 12.51|
| UK|London|London Gatwick|100.32|
+-------+------+--------------+------+数组包含
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['country'] == sparkDF2['country']
,'inner'
).select(sparkDF1['*'],sparkDF2['airport'])
finalDF = finalDF.withColumn('flag',F.array_contains( F.col('cities'),F.col('airport') ) )
finalDF.filter(F.col('flag') == True).show(truncate=False)
+---------------------------------+-------+--------------+----+
|cities |country|airport |flag|
+---------------------------------+-------+--------------+----+
|[London, London Luton, Luton] |UK |London Luton |true|
|[London, London Gatwick, Gatwick]|UK |London Gatwick|true|
+---------------------------------+-------+--------------+----+发布于 2022-03-02 15:34:38
您可以用机场名称在城市表上创建一个新列。然后,您可以简单地合并airport列上的两个表。
使用@Vaebhav中的prep核心:
sparkDF1 = sparkDF1.withColumn("airport", F.col("cities")[1])
sparkDF1.show(truncate=False)
+---------------------------------+-------+--------------+
|cities |country|airport |
+---------------------------------+-------+--------------+
|[London, London Luton, Luton] |UK |London Luton |
|[London, London Gatwick, Gatwick]|UK |London Gatwick|
+---------------------------------+-------+--------------+
finalDF = sparkDF1.join(sparkDF2, on="airport", how="right")
finalDF.show(truncate=False)
+--------------+---------------------------------+-------+-------+------+------+
|airport |cities |country|country|city |coords|
+--------------+---------------------------------+-------+-------+------+------+
|London Luton |[London, London Luton, Luton] |UK |UK |London|12.51 |
|London Gatwick|[London, London Gatwick, Gatwick]|UK |UK |London|100.32|
+--------------+---------------------------------+-------+-------+------+------+https://stackoverflow.com/questions/71323050
复制相似问题