首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >火花阵列保序

火花阵列保序
EN

Stack Overflow用户
提问于 2022-04-19 11:52:21
回答 1查看 99关注 0票数 0

我有这样一个结构,一个发票表和一个发票线表。我希望将这些行输出为强制模式中的JSON有序数组,按行号排序,但在模式中没有行号(假定数组中隐含行号)。据我所知,一旦创建好了这个数组,吡咯烷酮和json都将保持数组顺序。请看下面的粗略例子。如何确保发票行保持行号顺序。我可以使用列表理解,但这意味着退出火花,我认为这将是无效的。

代码语言:javascript
复制
from pyspark.sql.functions import collect_list, struct

invColumns = StructType([
    StructField("invoiceNo",StringType(),True),
    StructField("invoiceStuff",StringType(),True) 
  ])
invData = [("1", "stuff"), ("2", "other stuff"), ("3", "more stuff")]

invLines = StructType([
    StructField("lineNo",IntegerType(),True),
    StructField("invoiceNo",StringType(),True),
    StructField("detail",StringType(),True),
    StructField("quantity",IntegerType(),True)
  ])

lineData = [(1,"1","item stuff",3),(2,"1","new item stuff",2),(3,"1","old item stuff",5),(1,"2","item stuff",3),(1,"3","item stuff",3),(2,"3","more item stuff",7)]

invoice_df = spark.createDataFrame(data=invData,schema=invColumns)
#in reality read from a spark table


invLine_df = spark.createDataFrame(data=lineData,schema=invLines)
#in reality read from a spark table

invoicesTemp_df = (invoice_df.select('invoiceNo',
                                     'invoiceStuff')
                             .join(invLine_df.select('lineNo',
                                                     'InvoiceNo',
                                                     'detail', 
                                                     'quantity'
                                                     ),
                                                      on='invoiceNo'))

invoicesOut_df = (invoicesTemp_df.withColumn('invoiceLines',struct('detail','quantity'))
                                .groupBy('invoiceNo','invoiceStuff').agg(collect_list('invoiceLines').alias('invoiceLines'))
                                .select('invoiceNo',
                                        'invoiceStuff',
                                        'invoiceLines'
                                                    ))

display(invoicesOut_df)


3 -- more stuff  -- array -- 0: -- {"detail": "item stuff", "quantity": 3}
                          -- 1: -- {"detail": "more item stuff", "quantity": 7}
1 -- stuff       -- array -- 0: -- {"detail": "new item stuff", "quantity": 2}
                          -- 1: -- {"detail": "old item stuff", "quantity": 5}
                          -- 2: -- {"detail": "item stuff", "quantity": 3}
2 -- other stuff -- array -- 0: -- {"detail": "item stuff", "quantity": 3}

根据请求,以下是输入数据

代码语言:javascript
复制
Invoice Table
"InvoiceNo", "InvoiceStuff",
"1","stuff",
"2","other stuff",
"3","more stuff"

Invoice Lines Table
"LineNo","InvoiceNo","Detail","Quantity",
1,"1","item stuff",3,
2,"1","new item stuff",2,
3,"1","old item stuff",5,
1,"2","item stuff",3,
1,"3","item stuff",3,
2,"3","more item stuff",7

一个输出应该是这样的,但是数组应该是按照发票行表中的行号排序的,即使它不在输出中。

代码语言:javascript
复制
Output
"1","stuff","[{"detail": "item stuff", "quantity": 3},{"detail": "new item stuff", "quantity": 2},{"detail": "old item stuff", "quantity": 5}]",
"2","other stuff","[{"detail": "item stuff", "quantity": 3}]"
"3","more stuff","[{"detail": "item stuff", "quantity": 3},{"detail": "more item stuff", "quantity": 7}]"
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-04-19 15:21:08

collect_list 不尊重数据的顺序

注意,函数是不确定的,因为收集结果的顺序取决于行的顺序,而行的顺序在洗牌后可能是不确定的。

一种可能的方法是应用带有窗口函数的collect_list,在那里您可以控制订单。

代码语言:javascript
复制
from pyspark.sql import functions as F
from pyspark.sql import Window as W

(invoice_df
    .join(invLine_df, on='invoiceNo')
    .withColumn('invoiceLines', F.struct('lineNo', 'detail','quantity'))
    .withColumn('a', F.collect_list('invoiceLines').over(W.partitionBy('invoiceNo').orderBy('lineNo')))
    .groupBy('invoiceNo')
    .agg(F.max('a').alias('invoiceLines'))
    .show(10, False)
)

+---------+--------------------------------------------------------------------+
|invoiceNo|invoiceLines                                                        |
+---------+--------------------------------------------------------------------+
|1        |[{1, item stuff, 3}, {2, new item stuff, 2}, {3, old item stuff, 5}]|
|2        |[{1, item stuff, 3}]                                                |
|3        |[{1, item stuff, 3}, {2, more item stuff, 7}]                       |
+---------+--------------------------------------------------------------------+
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71924512

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档