我有一个类似于DataFrame的below.The列值是格式- 1_3_del_8_3,它基本上是由"_ del_“分隔的两个值。这里我们将得到两个部分- 1_3和8_3。我们可以在这里略去第二部分。我正在寻找获得第一部分不是_的最后一列值的最佳方法。
另外,我在DF中有50+列。
样本DataFrame
+-----------+-----------+-----------+
| c1| c2| c3|
+-----------+-----------+-----------+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3|
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3|
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3|
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3|
+-----------+-----------+-----------+预期结果
+-----------+-----------+-----------+-----------+
| c1| c2| c3|C4Out |
+-----------+-----------+-----------+-----------+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3| 2_3 |
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3| 2_5 |
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3| 2_8 |
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3| 7_3 |
+-----------+-----------+-----------+-----------+我的选拔-
val df1=Seq(
("1_3_del_8_3","2_3_del_6_3","0_0_del_8_3"),
("2_9_del_4_3","0_0_del_4_3","2_5_del_4_3"),
("2_8_del_4_3","0_0_del_4_3","0_0_del_4_3"),
("5_3_del_4_3","2_3_del_4_3","7_3_del_4_3")
)toDF("c1","c2","c3")然后我试着根据分隔符分割列,然后我被困在如何处理的问题上,我尝试了很多搜索,但在这里得到了一个小小的问题。
发布于 2020-05-02 11:35:45
检查下面的代码,以获得相同的结果在动态。
Logic
Step 1:将所有列的第一个值与空格合并,但包含"0_0“的列除外。
例如,以第一行- |1_3_del_8_3|2_3_del_6_3|0__del_8_3|合并除0_0以外的所有列的第一个值和空格1_3 2_3。
使用空格- Step 2:拆分值- Array(1_3,2_3)
Step 3:从数组中获取最后一个值。- 2_3是第一行的值。
对于剩余的行,请遵循相同的方式。请检查下面的代码。
scala> val df = Seq(("1_3_del_8_3","2_3_del_6_3","0_0_del_8_3"),("2_9_del_4_3","0_0_del_4_3","2_5_del_4_3"),("2_8_del_4_3","0_0_del_4_3","0_0_del_4_3"),("5_3_del_4_3","2_3_del_4_3","7_3_del_4_3"))toDF("c1","c2","c3") // Creating DF.
df: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 1 more field]
scala> val exprs = columns.map(c => when(substring(col(s"${c}"),1,3) =!= "0_0",substring(col(s"${c}"),1,3))) // Creating expressions
exprs: Array[org.apache.spark.sql.Column] = Array(CASE WHEN (NOT (substring(c1, 1, 3) = 0_0)) THEN substring(c1, 1, 3) END, CASE WHEN (NOT (substring(c2, 1, 3) = 0_0)) THEN substring(c2, 1, 3) END, CASE WHEN (NOT (substring(c3, 1, 3) = 0_0)) THEN substring(c3, 1, 3) END)
scala> val array = (split(concat_ws(" ",exprs:_*)," ")) // Extracting Arrays
array: org.apache.spark.sql.Column = split(concat_ws( , CASE WHEN (NOT (substring(c1, 1, 3) = 0_0)) THEN substring(c1, 1, 3) END, CASE WHEN (NOT (substring(c2, 1, 3) = 0_0)) THEN substring(c2, 1, 3) END, CASE WHEN (NOT (substring(c3, 1, 3) = 0_0)) THEN substring(c3, 1, 3) END), )
scala> val element = array((size(array)-1).cast("int")) // Extracting matched element
element: org.apache.spark.sql.Column = split(concat_ws( , CASE WHEN (NOT (substring(c1, 1, 3) = 0_0)) THEN substring(c1, 1, 3) END, CASE WHEN (NOT (substring(c2, 1, 3) = 0_0)) THEN substring(c2, 1, 3) END, CASE WHEN (NOT (substring(c3, 1, 3) = 0_0)) THEN substring(c3, 1, 3) END), )[CAST((size(split(concat_ws( , CASE WHEN (NOT (substring(c1, 1, 3) = 0_0)) THEN substring(c1, 1, 3) END, CASE WHEN (NOT (substring(c2, 1, 3) = 0_0)) THEN substring(c2, 1, 3) END, CASE WHEN (NOT (substring(c3, 1, 3) = 0_0)) THEN substring(c3, 1, 3) END), )) - 1) AS INT)]
scala> spark.time{ df.withColumn("c4",element).show} // showing final result.
+-----------+-----------+-----------+---+
| c1| c2| c3| c4|
+-----------+-----------+-----------+---+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3|2_3|
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3|2_5|
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3|2_8|
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3|7_3|
+-----------+-----------+-----------+---+
Time taken: 20 ms发布于 2020-05-02 09:21:04
您可以使用when()来有条件地检查用例:
when(Condition,value_if_condition_true).otherwise(value_if_condition_false)
根据您的用例,我们可以从右侧开始检查。如果c3满足条件,则将选择来自c3的值,否则它将签入c2,否则c1为null。
scala> df.show()
+-----------+-----------+-----------+
| c1| c2| c3|
+-----------+-----------+-----------+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3|
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3|
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3|
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3|
+-----------+-----------+-----------+
scala> df.withColumn("C4Out",
when(substring($"c3",1,3)!=="0_0",substring($"c3",9,11))
.when(substring($"c2",1,3)!=="0_0",substring($"c2",9,11))
.when(substring($"c1",1,3)!=="0_0",substring($"c1",9,11))
.otherwise(null)).show()
+-----------+-----------+-----------+-----------+
| c1| c2| c3|C4Out |
+-----------+-----------+-----------+-----------+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3| 2_3 |
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3| 2_5 |
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3| 2_8 |
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3| 7_3 |
+-----------+-----------+-----------+-----------+https://stackoverflow.com/questions/61556443
复制相似问题