我想用自己的相等比较器在DataFrame上使用DataFrame运算符。
让我们假设我想执行这样的操作:
df.groupBy("Year","Month").sum("Counter")在这个DataFrame中:
Year | Month | Counter
---------------------------
2012 | Jan | 100
12 | January | 200
12 | Janu | 300
2012 | Feb | 400
13 | Febr | 500我必须执行两个比较国:
( 1)栏年:体育。"2012“== "12”
( 2)栏月份:体育。“简”==“一月”“==”"Janu“
让我们假设我已经实现了这两个比较器。我怎么能调用它们?与这示例一样,我已经知道必须将DataFrame转换为RDD,以便能够使用比较器。
我考虑过使用RDD GroupBy。
注意,,我真的需要使用比较器来完成这个任务。我不能使用UDF、更改数据或创建新列。未来的想法是有密文列,在这些列中,我有一些函数,允许我比较两个密文是否相同。我想在我的比较器中使用它们。
编辑:
在这一刻,我只想用一个专栏来做这件事,比如:
df.groupBy("Year").sum("Counter")我有一个包装类:
class ExampleWrapperYear (val year: Any) extends Serializable {
// override hashCode and Equals methods
}那么,我就是这样做的:
val rdd = df.rdd.keyBy(a => new ExampleWrapperYear(a(0))).groupByKey()我在这里的问题是如何做“和”,以及如何使用多列的keyBy来使用ExampleWrapperYear和ExampleWrapperMonth。
发布于 2019-03-15 19:28:47
这个解决方案应该是实现work.Here和相等的case类(我们可以将它们称为比较器)。
您可以根据不同的密文修改/更新hashCode和等号。
case class Year(var year:Int){
override def hashCode(): Int = {
this.year = this.year match {
case 2012 => 2012
case 12 => 2012
case 13 => 2013
case _ => this.year
}
this.year.hashCode()
}
override def equals(that: Any): Boolean ={
val year1 = 2000 + that.asInstanceOf[Year].year % 100
val year2 = 2000 + this.year % 100
if (year1 == year2)
true
else
false
}
}
case class Month(var month:String){
override def hashCode(): Int = {
this.month = this.month match {
case "January" => "Jan"
case "Janu" => "Jan"
case "February" => "Feb"
case "Febr" => "Feb"
case _ => this.month
}
this.month.hashCode
}
override def equals(that: Any): Boolean ={
val month1 = this.month match {
case "January" => "Jan"
case "Janu" => "Jan"
case "February" => "Feb"
case "Febr" => "Feb"
case _ => this.month
}
val month2 = that.asInstanceOf[Month].month match {
case "January" => "Jan"
case "Janu" => "Jan"
case "February" => "Feb"
case "Febr" => "Feb"
case _ => that.asInstanceOf[Month].month
}
if (month1.equals(month2))
true
else
false
}
}下面是分组键的重要比较器,它只使用单个的col比较器。
case class Key(var year:Year, var month:Month){
override def hashCode(): Int ={
this.year.hashCode() + this.month.hashCode()
}
override def equals(that: Any): Boolean ={
if ( this.year.equals(that.asInstanceOf[Key].year) && this.month.equals(that.asInstanceOf[Key].month))
true
else
false
}
}
case class Record(year:Int,month:String,counter:Int) val df = spark.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv").as[Record]
df.rdd.groupBy[Key](
(record:Record)=>Key(Year(record.year), Month(record.month)))
.map(x=> Record(x._1.year.year, x._1.month.month, x._2.toList.map(_.counter).sum))
.toDS().show()这给
+----+-----+-------+
|year|month|counter|
+----+-----+-------+
|2012| Feb| 800|
|2013| Feb| 500|
|2012| Jan| 700|
+----+-----+-------+
for this input in data.csv
Year,Month,Counter
2012,February,400
2012,Jan,100
12,January,200
12,Janu,300
2012,Feb,400
13,Febr,500
2012,Jan,100请注意,对于case类(年份和月份),还将值更新为标准值(否则它所选择的值是不可预测的)。
发布于 2019-03-13 18:06:51
您可以使用udfs实现逻辑,使其成为标准的年/月格式。
def toYear : (Integer) => Integer = (year:Integer)=>{
2000 + year % 100 //assuming all years in 2000-2999 range
}
def toMonth : (String) => String = (month:String)=>{
month match {
case "January"=> "Jan"
case "Janu"=> "Jan"
case "February" => "Feb"
case "Febr" => "Feb"
case _ => month
}
}
val toYearUdf = udf(toYear)
val toMonthUdf = udf(toMonth)
df.groupBy( toYearUdf(col("Year")), toMonthUdf(col("Month"))).sum("Counter").show()https://stackoverflow.com/questions/55147029
复制相似问题