我有一个数据集,看起来像这样。
(34521658, 0001-01-01, 2500-01-01, 2 , A, Y, 15, P, A, 4776, 4776, 4776, {(11, P, A, 4776,4766, 4776), (12, P, A, 4776,4766, 4776), (13, P, A, 4776,4766, 4776)})现在我想把它拆开让它看起来像
(34521658, 0001-01-01, 2500-01-01, 2 , A, Y, 15, P, A, 4776, 4776, 4776, 11, P, A, 4776,4766, 4776)
(34521658, 0001-01-01, 2500-01-01, 2 , A, Y, 15, P, A, 4776, 4776, 4776, 12, P, A, 4776,4766, 4776)
(34521658, 0001-01-01, 2500-01-01, 2 , A, Y, 15, P, A, 4776, 4776, 4776, 13, P, A, 4776,4766, 4776)。
如何在pyspark中做到这一点?
发布于 2017-08-19 00:19:12
按照注释中的建议,可以使用flatMap或explode。以下是如何使用explode sql函数(explode,顾名思义,它会将数组或映射列扩展为更多行)为了简化方法,我将只保留有意义的列。假设第一列是一个id,而要分解的列的名称为bag,则初始数据集的外观如下所示
+--------+--------------------+
| id| bag|
+--------+--------------------+
|34521658|[[11,P,A,4776,476...|
+--------+--------------------+数据集的架构为:
scala> df.printSchema
root
|-- id: integer (nullable = true)
|-- bag: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = true)
| | |-- _2: string (nullable = true)
| | |-- _3: string (nullable = true)
| | |-- _4: integer (nullable = true)
| | |-- _5: integer (nullable = true)
| | |-- _6: integer (nullable = true)请注意,bag列是一个元素数组。在这个柱子上,我们可以像这样应用分解函数:
df.withColumn("bag", explode($"bag"))生成的数据集/数据帧为:
+--------+--------------------+
| id| bag|
+--------+--------------------+
|34521658|[11,P,A,4776,4766...|
|34521658|[12,P,A,4776,4766...|
|34521658|[13,P,A,4776,4766...|
+--------+--------------------+希望能有所帮助
https://stackoverflow.com/questions/45758959
复制相似问题