我正在寻找一种最优雅和有效的方法,用描述的输出和输入将字典转换为带有PySpark的火花数据帧。
输入:
data = {"key1" : ["val1", "val2", "val3"], "key2" : ["val3", "val4", "val5"]}输出:
vals | keys
------------
"val1" | ["key1"]
"val2" | ["key1"]
"val3" | ["key1", "key2"]
"val4" | ["key2"]
"val5" | ["key2"]编辑:我更喜欢用火花来做大部分的操作。也许先把它转换成
vals | keys
------------
"val1" | "key1"
"val2" | "key1"
"val3" | "key1"
"Val3" | "key2"
"val4" | "key2"
"val5" | "key2"发布于 2022-08-16 16:02:01
首先,从字典项构建星火数据。然后explode vals,然后按包含该值的vals和collect all keys进行分组。
from pyspark.sql import functions as F
data = {"key1" : ["val1", "val2", "val3"], "key2" : ["val3", "val4", "val5"]}
df = spark.createDataFrame(data.items(), ("keys", "vals"))
(df.withColumn("vals", F.explode("vals"))
.groupBy("vals").agg(F.collect_list("keys").alias("keys"))
).show()
"""
+----+------------+
|vals| keys|
+----+------------+
|val1| [key1]|
|val3|[key1, key2]|
|val2| [key1]|
|val4| [key2]|
|val5| [key2]|
+----+------------+
"""https://stackoverflow.com/questions/73373066
复制相似问题