在重塑数据的同时,如何计算和?
val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
someDf.show()
+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
| user1| math| algebra-1| 90|
| user1| physics| gravity| 70|
| user3| biology| health| 50|
| user2| biology| health| 100|
| user1| math| algebra-1| 40|
| user2| physics| gravity-2| 20|
+-------+---------+-----------+-----+
val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
result.show()
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
|user3 |biology |null |null |null |50 |
|user1 |math |90 |null |null |null |
|user2 |biology |null |null |null |100 |
|user2 |physics |null |null |20 |null |
|user1 |physics |null |70 |null |null |
+-------+---------+---------+-------+---------+------+预期输出:应该得到所有lesson_names的和
+-------+---------+---------+-------+---------+------+----+
|user_id|course_id|algebra-1|gravity|gravity-2|health|sum |
+-------+---------+---------+-------+---------+------+----+
|user3 |biology |null |null |null |50 |50 |
|user1 |math |90 |null |null |null |90 |
|user2 |biology |null |null |null |100 |100 |
|user2 |physics |null |null |20 |null |20 |
|user1 |physics |null |70 |null |null |70 |
+-------+---------+---------+-------+---------+------+----+但是,如何获得特定course_id和batch_id的所有course_id字段得分值之和?
有什么建议吗?
发布于 2019-09-05 12:55:05
有些我是这样用Window.partitionBy实现的,它可能会对某些人有用
import org.apache.spark.sql.expressions.Window
val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
val assessmentAggDf = Window.partitionBy("user_id","course_id")
val aggregatedDF = someDF.withColumn("total_sum_score", sum("score") over assessmentAggDf)
val result = aggregatedDF.groupBy("user_id", "course_id","total_sum_score").pivot("lesson_name").agg(first("score"))发布于 2019-09-04 11:39:06
@manju我是给你写的,但只为这个问题写 spark 2.4.3
scala> result.show
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user1| math| 90| null| null| null|
| user2| biology| null| null| null| 100|
| user2| physics| null| null| 20| null|
| user1| physics| null| 70| null| null|
+-------+---------+---------+-------+---------+------+将所有列名中的"-“替换为"_”,因为它会导致selectExpr()中访问数据帧列的错误。
scala> val new_cols = result.columns.map(x => x.replaceAll("-", "_"))对于计算null和Integer之和,通常是不可能的,但是我们可以使用聚结函数来检索所需的输出。
scala> result.toDF(new_cols : _*).selectExpr("*","coalesce(algebra_1, 0) +coalesce(gravity, 0)+coalesce(gravity_2, 0)+coalesce(health,0) sum ").show
+-------+---------+---------+-------+---------+------+-----+
|user_id|course_id|algebra_1|gravity|gravity_2|health| sum|
+-------+---------+---------+-------+---------+------+-----+
| user3| biology| null| null| null| 50| 50.0|
| user1| math| 90| null| null| null| 90.0|
| user2| biology| null| null| null| 100|100.0|
| user2| physics| null| null| 20| null| 20.0|
| user1| physics| null| 70| null| null| 70.0|
+-------+---------+---------+-------+---------+------+-----+了解更多关于聚结的信息,如果您有其他查询,请告诉我。如果它解决了您的problem.Happy HAdoop,则接受这个答案。
https://stackoverflow.com/questions/57785903
复制相似问题