首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Spark &复杂案例语句推导

Spark &复杂案例语句推导
EN

Stack Overflow用户
提问于 2017-11-17 15:36:40
回答 2查看 1.5K关注 0票数 1

我需要使用Scala在dataframes上实现以下场景:

代码语言:javascript
复制
Scenarios-1: If the "KEY" exist one time, take the "TYPE_VAL" as is . 
             Eg: KEY=66 exist once so take the TYPE_VAL=100
Scenarios-2: If the "KEY" exist more than one time, Check for the same TYPE_VAL, if it is same, then take TYPE_VAL once . 
             Eg: for KEY=68,so TYPE_VAL=23 
Scenarios-3: If the "KEY" exist more than one time, Check for the same TYPE_VAL and subtract the other TYPE_VAL. 
             Eg: for KEY=67 , TYPE_VAL=10 exists twice,so subtract 2 & 4 from 10, finally TYPE_VAL=4

我尝试过使用group作为相同的键,但无法导出所有的方案。

代码语言:javascript
复制
   //Sample Input Values
    val values = List(List("66","100") ,
    List("67","10") , List("67","10"),List("67","2"),List("67","4")
    List("68","23"),List("68","23")).map(x =>(x(0), x(1)))

    import spark.implicits._
    //created a dataframe
    val df1 = values.toDF("KEY","TYPE_VAL")

    df1.show(false)
    ------------------------
    KEY |TYPE_VAL  |
    ------------------------
    66  |100       |
    67  |10        |
    67  |10        |
    67  |2         |
    67  |4         |
    68  |23        |
    68  |23        |
    -------------------------

预期产出:

代码语言:javascript
复制
df2.show(false)
    ------------------------
    KEY |TYPE_VAL  |
    ------------------------
    66  |100       | -------> [single row ,so 100]
    67  |4         | -------> [four rows,out of which two are same & rest are diffrent, so (10 - 2 - 4) = 4 ]
    68  |23        | -------> [two rows with same values, so 23]
    -------------------------
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-11-17 16:29:09

如果您可以假设每个键的记录数不能太大(例如,最多可达数千条?),则可以在分组后使用collect_list将所有匹配项获取到一个数组中,然后使用UDF根据该数组计算结果:

代码语言:javascript
复制
import org.apache.spark.sql.functions._
import spark.implicits._

// create the sample data:
val df1 = List(
  (66, 100),
  (67, 10),
  (67, 10),
  (67, 2),
  (67, 4),
  (68, 23),
  (68, 23)
).toDF("KEY", "TYPE_VAL")

// define a UDF that computes the result per scenario for a given Seq[Int]. 
// This is just one possible implementation, simpler ones probably exist...
val computeTypeVal = udf { (vals: Seq[Int]) =>
  vals.groupBy(identity).values.toList.sortBy(-_.size).flatten match {
    case a :: Nil => a
    case a :: b :: tail if a == b => a - tail.filterNot(_ == a).sum
    case _ => 0 // or whatever else should be done for other cases
  }
}

// group by key, use functions.collect_list to collect all value per key and apply UDF
df1.groupBy($"KEY")
  .agg(collect_list($"TYPE_VAL") as "VALS")
  .select($"KEY", computeTypeVal($"VALS") as "TYPE_VAL")
  .sort($"KEY")
  .show() 
票数 1
EN

Stack Overflow用户

发布于 2017-11-18 20:08:06

增强用户Tzach Zohar共享的解决方案,以处理输入列具有不同数据类型(如Int、Double、null )的问题

代码语言:javascript
复制
val df1 = List(
  (66, Some("100")),
  (67, Some("10.4")),
  (67, Some("10.4")),
  (67, Some("2")),
  (67, Some("4")),
  (68, Some("23")),
  (68, Some("23")),
  (99, None),
  (999,Some(""))
).toDF("KEY", "TYPE_VAL")

df1.show()
+---+--------+
|KEY|TYPE_VAL|
+---+--------+
| 66|     100|
| 67|    10.4|
| 67|    10.4|
| 67|       2|
| 67|       4|
| 68|      23|
| 68|      23|
| 99|    null|
|999|        |
+---+--------+

因此,增强的udf如下:

代码语言:javascript
复制
val computeTypeVal = udf { (vals: Seq[String]) =>
  vals.groupBy(identity).values.toList.sortBy(-_.size).flatten match {
    case a :: Nil => if (a == "") None else Some(a.toDouble) 
    case a :: b :: tail if a == b => Some(a.toDouble - tail.map(_.toDouble).filterNot(_ == a.toDouble).sum)
    case _ => Some(0.00) // or whatever else should be done for other cases
  }
}

df1.groupBy($"KEY").agg(collect_list($"TYPE_VAL") as "VALS").select($"KEY", computeTypeVal($"VALS") as "TYPE_VAL").show()

+---+--------+
|KEY|TYPE_VAL|
+---+--------+
| 68|    23.0|
|999|    null|
| 99|     0.0|
| 66|   100.0|
| 67|     4.4|
+---+--------+
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47354212

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档