首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >火花数据透视用额外列进行整形

火花数据透视用额外列进行整形
EN

Stack Overflow用户
提问于 2019-09-04 09:54:03
回答 2查看 139关注 0票数 1

在重塑数据的同时,如何计算和?

代码语言:javascript
复制
val someDF = Seq(
  ("user1", "math","algebra-1","90"),
  ("user1", "physics","gravity","70"),
  ("user3", "biology","health","50"),
  ("user2", "biology","health","100"),
  ("user1", "math","algebra-1","40"),
  ("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")

someDf.show()
+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
|  user1|     math|  algebra-1|   90|
|  user1|  physics|    gravity|   70|
|  user3|  biology|     health|   50|
|  user2|  biology|     health|  100|
|  user1|     math|  algebra-1|   40|
|  user2|  physics|  gravity-2|   20|
+-------+---------+-----------+-----+


val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
result.show()
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
|user3  |biology  |null     |null   |null     |50    |
|user1  |math     |90       |null   |null     |null  |
|user2  |biology  |null     |null   |null     |100   |
|user2  |physics  |null     |null   |20       |null  |
|user1  |physics  |null     |70     |null     |null  |
+-------+---------+---------+-------+---------+------+

预期输出:应该得到所有lesson_names的和

代码语言:javascript
复制
+-------+---------+---------+-------+---------+------+----+
|user_id|course_id|algebra-1|gravity|gravity-2|health|sum |
+-------+---------+---------+-------+---------+------+----+
|user3  |biology  |null     |null   |null     |50    |50  |
|user1  |math     |90       |null   |null     |null  |90  |
|user2  |biology  |null     |null   |null     |100   |100 | 
|user2  |physics  |null     |null   |20       |null  |20  | 
|user1  |physics  |null     |70     |null     |null  |70  | 
+-------+---------+---------+-------+---------+------+----+

但是,如何获得特定course_idbatch_id的所有course_id字段得分值之和?

有什么建议吗?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-09-05 12:55:05

有些我是这样用Window.partitionBy实现的,它可能会对某些人有用

代码语言:javascript
复制
import org.apache.spark.sql.expressions.Window

val someDF = Seq(
  ("user1", "math","algebra-1","90"),
  ("user1", "physics","gravity","70"),
  ("user3", "biology","health","50"),
  ("user2", "biology","health","100"),
  ("user1", "math","algebra-1","40"),
  ("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")

  val assessmentAggDf = Window.partitionBy("user_id","course_id")
  val aggregatedDF = someDF.withColumn("total_sum_score", sum("score") over assessmentAggDf)

  val result = aggregatedDF.groupBy("user_id", "course_id","total_sum_score").pivot("lesson_name").agg(first("score"))
票数 1
EN

Stack Overflow用户

发布于 2019-09-04 11:39:06

@manju我是给你写的,但只为这个问题写 spark 2.4.3

代码语言:javascript
复制
scala> result.show
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
|  user3|  biology|     null|   null|     null|    50|
|  user1|     math|       90|   null|     null|  null|
|  user2|  biology|     null|   null|     null|   100|
|  user2|  physics|     null|   null|       20|  null|
|  user1|  physics|     null|     70|     null|  null|
+-------+---------+---------+-------+---------+------+

将所有列名中的"-“替换为"_”,因为它会导致selectExpr()中访问数据帧列的错误。

代码语言:javascript
复制
scala> val new_cols =  result.columns.map(x => x.replaceAll("-", "_"))

对于计算null和Integer之和,通常是不可能的,但是我们可以使用聚结函数来检索所需的输出。

代码语言:javascript
复制
scala> result.toDF(new_cols : _*).selectExpr("*","coalesce(algebra_1, 0) +coalesce(gravity, 0)+coalesce(gravity_2, 0)+coalesce(health,0)  sum ").show
+-------+---------+---------+-------+---------+------+-----+
|user_id|course_id|algebra_1|gravity|gravity_2|health|  sum|
+-------+---------+---------+-------+---------+------+-----+
|  user3|  biology|     null|   null|     null|    50| 50.0|
|  user1|     math|       90|   null|     null|  null| 90.0|
|  user2|  biology|     null|   null|     null|   100|100.0|
|  user2|  physics|     null|   null|       20|  null| 20.0|
|  user1|  physics|     null|     70|     null|  null| 70.0|
+-------+---------+---------+-------+---------+------+-----+

了解更多关于聚结的信息,如果您有其他查询,请告诉我。如果它解决了您的problem.Happy HAdoop,则接受这个答案。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57785903

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档