我有一个包含200万个字段的表,它是在Spark中注册的。
这张桌子是这样的:
CUSTADDRESSID ADDRESSTYPE ADDRESSLINE1 ADDRESSLINE2 ADDRESSLINE3 CITY STATE COUNTRY ZIP1 ISACTIVE ISCOMMUNICATION CREATEDDATE CREATEDUSER UPDATEDDATE UPDATEDUSER REASONCODE ZIP2 C_ACCNO CUSTOMERID ACCOUNTGROUPID PREPAIDACCOUNTSTATUSID PREPAIDACCOUNTSTATUSDATE SOURCEOFENTRY REVENUECATEGORYID VEHICLENUMBER VEHICLECLASS SERIALNO HEXTAGID TAGSTATUS TAGSTARTEFFDATE TAGENDEFFDATE ISTAGBLACKLISTED ISBLACKLISTHOLD RCVERIFICATIONSTATUS EMAILADDRESS PHONENUMBER CCreatedDate CCreatedUser CUpdatedDate CUpdatedUser HISTID ACTION ISFEEWAIVER FEEWAIVERPASSTYPE VEHICLEIMGVERIFICATIONSTATUS TAGTID ISREVENUERECHARGE RowNumber
41 Mailing B309 PROGRESSIVE SIGNATURE SECTOR-6 GHANSOLI NAVI MUMBAI MH IND 400701 1 1 2013-06-07 12:55:54.827 bhagwadapos 2013-06-07 12:55:54.827 bhagwadapos NULL NULL 10003014 20000001 15 3079 2015-09-16 14:58:27.500 RegularRetailer 75 MH43AJ411 4 206158433290 91890704803000000C0A TAGINACTIVE 2014-08-08 14:24:12.227 2039-08-08 23:59:59.000 1 0 NULL shankarn75@rediffmail.com 9004419178 2013-06-07 12:56:16.650 bhagwadapos 2015-09-16 14:58:33.190 BatchProcess 15250 UPDATE NULL NULL NULL NULL NULL 1我想转换成JSON,JSON文件应该是这样的,很抱歉,我已经手动设计了:
ACCOUNTNO : 10003018
ADDRESS : Array
0 : Object
VEHICLE : Array
0 : Object我已经编写了Spark查询,但是我无法在VEHICLE & ADDRESS下创建ACCOUNTNO下的两个数组
以下是查询:
val query2 = "SELECT C_ACCNO AS ACCOUNTNO, collect_set(struct(VEHICLENUMBER, CUSTOMERID,ACCOUNTGROUPID,PREPAIDACCOUNTSTATUSID,PREPAIDACCOUNTSTATUSDATE,SOURCEOFENTRY,REVENUECATEGORYID,VEHICLECLASS,SERIALNO,HEXTAGID,TAGSTATUS,TAGSTARTEFFDATE,TAGENDEFFDATE,ISTAGBLACKLISTED,ISBLACKLISTHOLD,RCVERIFICATIONSTATUS,EMAILADDRESS,PHONENUMBER,CREATEDDATE,CREATEDUSER,UPDATEDDATE,UPDATEDUSER,ISFEEWAIVER,FEEWAIVERPASSTYPE,VEHICLEIMGVERIFICATIONSTATUS,TAGTID,ISREVENUERECHARGE)) as VEHICLE FROM joined_acc_add GROUP BY ACCOUNTNO ORDER BY ACCOUNTNO"在此之后:
val res01 = sqlContext.sql(query2.toString)
res01.coalesce(1).write.json("D:/result01")我需要帮助才能在查询中发现我的错误。此查询正在抛出错误。
发布于 2018-08-08 21:20:24
我已经做过了。在一个对象下创建多个数组的查询应该如下所示。
val query1 = "SELECT PKcolumn1, collect_set(struct(column2, column3)), collect_set(struct(column4, column5)), collect_set(struct(column6, column7)) GROUP BY PKcolumn1 ORDER BY PKcolumn1"https://stackoverflow.com/questions/51722329
复制相似问题