文章/答案/技术大牛

发布

社区首页 >问答首页 >如何用我的平等比较器来GroupBy火花DataFrame？

问如何用我的平等比较器来GroupBy火花DataFrame？
EN

Stack Overflow用户

提问于 2019-03-13 16:40:11

回答 2查看 425关注 0票数 3

我想用自己的相等比较器在DataFrame上使用DataFrame运算符。

让我们假设我想执行这样的操作：

df.groupBy("Year","Month").sum("Counter")

在这个DataFrame中：

Year    | Month      | Counter  
---------------------------
2012    | Jan        | 100          
12      | January    | 200       
12      | Janu       | 300       
2012    | Feb        | 400       
13      | Febr       | 500

我必须执行两个比较国：

( 1)栏年:体育。"2012“== "12”

( 2)栏月份:体育。“简”==“一月”“==”"Janu“

让我们假设我已经实现了这两个比较器。我怎么能调用它们？与这示例一样，我已经知道必须将DataFrame转换为RDD，以便能够使用比较器。

我考虑过使用RDD GroupBy。

注意，，我真的需要使用比较器来完成这个任务。我不能使用UDF、更改数据或创建新列。未来的想法是有密文列，在这些列中，我有一些函数，允许我比较两个密文是否相同。我想在我的比较器中使用它们。

编辑：

在这一刻，我只想用一个专栏来做这件事，比如：

df.groupBy("Year").sum("Counter")

我有一个包装类：

class ExampleWrapperYear (val year: Any) extends Serializable {
      // override hashCode and Equals methods
}

那么，我就是这样做的：

val rdd = df.rdd.keyBy(a => new ExampleWrapperYear(a(0))).groupByKey()

我在这里的问题是如何做“和”，以及如何使用多列的keyBy来使用ExampleWrapperYear和ExampleWrapperMonth。

scala

sorting

apache-spark

apache-spark-sql

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-03-15 19:28:47

这个解决方案应该是实现work.Here和相等的case类(我们可以将它们称为比较器)。

您可以根据不同的密文修改/更新hashCode和等号。

  case class Year(var year:Int){

    override def hashCode(): Int = {
      this.year = this.year match {
        case 2012 => 2012
        case 12 => 2012
        case 13 => 2013
        case _ => this.year
      }
      this.year.hashCode()
    }

    override def equals(that: Any): Boolean ={
      val year1 = 2000 + that.asInstanceOf[Year].year % 100
      val year2 = 2000 + this.year % 100
      if (year1 == year2)
        true
      else
        false
    }
  }

  case class Month(var month:String){

    override def hashCode(): Int = {
      this.month = this.month match {
        case "January" => "Jan"
        case "Janu" => "Jan"
        case "February" => "Feb"
        case "Febr" => "Feb"
        case _ => this.month
      }
      this.month.hashCode
    }

    override def equals(that: Any): Boolean ={
      val month1 = this.month match {
        case "January" => "Jan"
        case "Janu" => "Jan"
        case "February" => "Feb"
        case "Febr" => "Feb"
        case _ => this.month
      }
      val month2 = that.asInstanceOf[Month].month match {
        case "January" => "Jan"
        case "Janu" => "Jan"
        case "February" => "Feb"
        case "Febr" => "Feb"
        case _ => that.asInstanceOf[Month].month
      }
      if (month1.equals(month2))
        true
      else
        false
    }
  }

下面是分组键的重要比较器，它只使用单个的col比较器。

  case class Key(var year:Year, var month:Month){

    override def hashCode(): Int ={
      this.year.hashCode() + this.month.hashCode()
    }

    override def equals(that: Any): Boolean ={
      if ( this.year.equals(that.asInstanceOf[Key].year) && this.month.equals(that.asInstanceOf[Key].month))
        true
      else
        false
    }
  }

  case class Record(year:Int,month:String,counter:Int)

  val df = spark.read.format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("data.csv").as[Record]

  df.rdd.groupBy[Key](
      (record:Record)=>Key(Year(record.year), Month(record.month)))
      .map(x=> Record(x._1.year.year, x._1.month.month, x._2.toList.map(_.counter).sum))
      .toDS().show()

这给

+----+-----+-------+
|year|month|counter|
+----+-----+-------+
|2012|  Feb|    800|
|2013|  Feb|    500|
|2012|  Jan|    700|
+----+-----+-------+

for this input in data.csv

Year,Month,Counter
2012,February,400
2012,Jan,100
12,January,200
12,Janu,300
2012,Feb,400
13,Febr,500
2012,Jan,100

请注意，对于case类(年份和月份)，还将值更新为标准值(否则它所选择的值是不可预测的)。

票数 1

Stack Overflow用户

发布于 2019-03-13 18:06:51

您可以使用udfs实现逻辑，使其成为标准的年/月格式。

  def toYear : (Integer) => Integer = (year:Integer)=>{
    2000 + year % 100 //assuming all years in 2000-2999 range
  }

  def toMonth : (String) => String = (month:String)=>{
    month match {
      case "January"=> "Jan"
      case "Janu"=> "Jan"
      case "February" => "Feb"
      case "Febr" => "Feb"
      case _ => month
    }
  }

  val toYearUdf = udf(toYear)
  val toMonthUdf = udf(toMonth)

  df.groupBy( toYearUdf(col("Year")), toMonthUdf(col("Month"))).sum("Counter").show()

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55147029

复制

相似问题

问如何用我的平等比较器来GroupBy火花DataFrame？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用我的平等比较器来GroupBy火花DataFrame？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用我的平等比较器来GroupBy火花DataFrame？
EN