我有一个带有以下模式的拼花文件
root
|-- listOfMetrics: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Action: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- date: date (nullable = true)
| | |-- female_executives: double (nullable = true)
| | |-- male_executives: double (nullable = true)
| | |-- female_directors: double (nullable = true)
| | |-- male_directors: double (nullable = true)
| | |-- female_executives_and_directors: double (nullable = true)
| | |-- male_executives_and_directors: double (nullable = true)
| | |-- flag: integer (nullable = true)df.show()返回如下内容
+----------------------------------+
| listOfMetrics |
+----------------------------------+
| [[ADD, 5394, 2...|
| [[ADD, 527, 20...|
| [[ADD, 714, 20...|
| [[ADD, 765, 20...|
| [[ADD, 996, 20...|
| [[ADD, 146, 20...|
| [[ADD, 947, 20...|
+----------------------------------+“行动”专栏是我的目标。此列可以包含“删除”或“添加”,因此基于此,我需要分隔行。我采取的方法是使用pyspark.sql将其扁平化,然后分离,然后将其重新转换回其原始形式,但在转换步骤中失败。我有以下问题
我是新来的,发现很难做到这一点。
谢谢
发布于 2021-05-15 15:02:47
可以使用过滤器将每个数组分成两个部分:一个部分只包含带有ADD的元素,另一个部分只包含带有DELETE的元素。在此之后,每个部分将成为一个单独的行使用堆栈。可以保留原有的结构,而不必对数据进行平放。
我正在使用一组稍微简化的测试数据:
root
|-- listOfMetrics: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Action: string (nullable = true)
| | |-- flag: long (nullable = true)
| | |-- id: long (nullable = true)
+----------------------------------------------------------------------------------+
|listOfMetrics |
+----------------------------------------------------------------------------------+
|[[ADD, 2, 1], [ADD, 4, 3], [DELETE, 6, 5], [DELETE, 8, 7], [ADD, 10, 9]] |
|[[ADD, 12, 11], [ADD, 14, 13], [DELETE, 16, 15], [DELETE, 18, 17], [ADD, 110, 19]]|
+----------------------------------------------------------------------------------+守则:
df.withColumn("listOfMetrics", F.expr("""stack(2,
filter(listOfMetrics, e -> e.Action = 'ADD'),
filter(listOfMetrics, e -> e.Action = 'DELETE'))""")) \
.show(truncate=False)输出:
+----------------------------------------------+
|listOfMetrics |
+----------------------------------------------+
|[[ADD, 2, 1], [ADD, 4, 3], [ADD, 10, 9]] |
|[[DELETE, 6, 5], [DELETE, 8, 7]] |
|[[ADD, 12, 11], [ADD, 14, 13], [ADD, 110, 19]]|
|[[DELETE, 16, 15], [DELETE, 18, 17]] |
+----------------------------------------------+https://stackoverflow.com/questions/67103853
复制相似问题