我有一个大的(23M行),格式如下:
names, sentiment
["Lily","Kerry","Mona"], 10
["Kerry", "Mona"], 2
["Mona"], 0我想计算名称列中每个唯一名称的平均情绪,结果是:
name, sentiment
"Lily", 10
"Kerry", 6
"Mona", 4发布于 2020-06-18 02:52:29
只需引爆数组,然后分组
火花当量
import pyspark.sql.functions as f
df1 = df.select(f.explode('names').alias('name'),'sentiment')
df1.groupBy('name').agg(f.avg('sentiment').alias('sentiment')).show()发布于 2020-06-17 20:00:44
val avgDF = Seq((Seq("Lily","Kerry","Mona"), 10),
(Seq("Kerry", "Mona"), 2),
(Seq("Mona"), 0)
).toDF("names", "sentiment")
val avgDF1 = avgDF.withColumn("name", explode('names))
val avgResultDF = avgDF1.groupBy("name").agg(avg(col("sentiment")))
avgResultDF.show(false)
// +-----+--------------+
// |name |avg(sentiment)|
// +-----+--------------+
// |Lily |10.0 |
// |Kerry|6.0 |
// |Mona |4.0 |
// +-----+--------------+https://stackoverflow.com/questions/62436130
复制相似问题