数据帧a:
SN Hash_id Name
111 11ww11 Airtel
222 null Idea数据帧b:
SN Hash_id Name
333 null BSNL
444 22ee11 Vodafone按列名对这些数据帧执行UnionAll,如下所示:
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).unionAll(b.select(columns: _*))
} 结果是:数据帧c
SN Hash_id Name
111 11ww11 Airtel
222 null Idea
333 null BSNL
444 22ee11 Vodafone对数据帧c执行过滤。
val withHashDF = c.where(c("Hash_id").isNotNull)
val withoutHashDF = c.where(c("Hash_id").isNull)withHashDF结果是:它只给出数据帧a的结果
111 11ww11 Airtel在存在散列id的地方,数据帧B的记录丢失:
444 22ee11 VodafonewithoutHashDF的结果是:
222 null Idea
BSNL 333 null
null 222 Idea在此DF中,列值不是按列名计算的,计数是3,而它应该只是2。从数据帧"a“行重复。
发布于 2017-02-24 13:55:51
看看unionByName,在获取columns方面有一个小小的变化
更改自
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq至
val columns = a.columns.intersect(b.columns).map(row => new Column(row)).toSeq那么它应该会像预期的那样工作。
请看下面的完整代码片段和结果:
import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
val dataFrameA = Seq(("111", "11ww11", "Airtel"),("222", null, "Idea")).toDF("SN","Hash_id", "Name")
val dataFrameB = Seq(("333", null, "BSNL"),("444", "22ee11", "Vodafone")).toDF("SN","Hash_id", "Name")
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.intersect(b.columns).map(row => new Column(row)).toSeq
a.select(columns: _*).union(b.select(columns: _*))
}
val dataFrameC = unionByName(dataFrameA, dataFrameB)
val withHashDF = dataFrameC.where(dataFrameC("Hash_id").isNotNull)
val withoutHashDF = dataFrameC.where(dataFrameC("Hash_id").isNull)
println("dataFrameC")
dataFrameC.show()
println("withHashDF")
withHashDF.show
println("withoutHashDF")
withoutHashDF.show输出:
dataFrameC
+---+-------+--------+
| SN|Hash_id| Name|
+---+-------+--------+
|111| 11ww11| Airtel|
|222| null| Idea|
|333| null| BSNL|
|444| 22ee11|Vodafone|
+---+-------+--------+
withHashDF
+---+-------+--------+
| SN|Hash_id| Name|
+---+-------+--------+
|111| 11ww11| Airtel|
|444| 22ee11|Vodafone|
+---+-------+--------+
withoutHashDF
+---+-------+----+
| SN|Hash_id|Name|
+---+-------+----+
|222| null|Idea|
|333| null|BSNL|
+---+-------+----+发布于 2017-02-27 22:25:49
如果Dataframe(Unionall)中有重复项,则会为筛选器或where子句提供意外的结果。一旦您使用distinct方法消除了重复项,结果就会与预期一致。
https://stackoverflow.com/questions/42431324
复制相似问题