我有一个JSON,我把它转换成了一个字典,并试图用它制作一个数据格式。问题在于它是多个嵌套的,而且数据不一致。
例如:
d = """[
{
"id": 51,
"kits": [
{
"id": 57,
"kit": "KIT1182A",
"items": [
{
"id": 254,
"product": {
"name": "Plastic Pallet",
"short_code": "PP001",
"priceperunit": 2500,
"volumetric_weight": 21.34
},
"quantity": 5
},
{
"id": 258,
"product": {
"name": "Separator Sheet",
"short_code": "FSS001",
"priceperunit": 170,
"volumetric_weight": 0.9
},
"quantity": 18
}
],
"quantity": 5
}, #end of kit
{
"id": 58,
"kit": "KIT1182B",
"items": [
{
"id": 259,
"product": {
"name": "Plastic Pallet",
"short_code": "PP001",
"priceperunit": 2500,
"volumetric_weight": 21.34
},
"quantity": 5
},
{
"id": 260,
"product": {
"name": "Plastic Sidewall",
"short_code": "PS001",
"priceperunit": 1250,
"volumetric_weight": 16.1
},
"quantity": 5
},
{
"id": 261,
"product": {
"name": "Plastic Lid",
"short_code": "PL001",
"priceperunit": 1250,
"volumetric_weight": 9.7
},
"quantity": 5
}
],
"quantity": 7
} #end of kit
],
"warehouse": "Yantraksh Logistics Private limited_GGNPC1",
"receiver_client": "Lumax Cornaglia Auto Tech Private Limited",
"transport_by": "Kiran Roadways",
"transaction_type": "Return",
"transaction_date": "2020-08-13T04:34:11.678000Z",
"transaction_no": 1180,
"is_delivered": false,
"driver_name": "__________",
"driver_number": "__________",
"lr_number": 0,
"vehicle_number": "__________",
"freight_charges": 0,
"vehicle_type": "Part Load",
"remarks": "0",
"flow": 36,
"owner": 2
} ]"""我想把它转换成一个数据文件,如下所示:
transaction_no is_delivered flow transaction_date receiver_client warehouse kits quantity product1 quantity1 product2 quantity2 product3 quantity3
1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 KIT1182A 5 PP001 5 FSS001 18 NaN NaN
1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 KIT1182B 7 PP001 5 PS001 5 PL001 7.0或者以一种更好的方式表现出来:

我所做的:
data = json.loads(d)
result_dataframe = pd.DataFrame(data)
l = ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse','kits'] #fields that I need
result_dataframe = result_dataframe[l]
result_dataframe.to_csv("out.csv")我试过:
def flatten(input_dict, separator='_', prefix=''):
output_dict = {}
for key, value in input_dict.items():
if isinstance(value, dict) and value:
deeper = flatten(value, separator, prefix+key+separator)
output_dict.update({key2: val2 for key2, val2 in deeper.items()})
elif isinstance(value, list) and value:
for index, sublist in enumerate(value, start=1):
if isinstance(sublist, dict) and sublist:
deeper = flatten(sublist, separator, prefix+key+separator+str(index)+separator)
output_dict.update({key2: val2 for key2, val2 in deeper.items()})
else:
output_dict[prefix+key+separator+str(index)] = value
else:
output_dict[prefix+key] = value
return output_dict但是它给出了单个行中的所有值,我如何在工具包的基础上分离它们并得到结果?
发布于 2020-08-20 17:05:02
像上面这样的数据转换非常常见。熊猫提供了很多工具来帮助你完成这项任务。
data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse']) # line 1
df = df.explode('items') # line 2
df[['product_code', 'product_quantity']] = df['items'].apply(lambda x: pd.Series([x['product']['short_code'], x['quantity']])) # line 3
df.drop(columns=['items']) # line 4会为你提供
id kit quantity transaction_no is_delivered flow transaction_date receiver_client warehouse product_code product_quantity
0 57 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5
0 57 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 FSS001 18
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PS001 5
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PL001 5 实际上,诀窍只存在于pd.json_normalize (第1行)中。它将创建一个与您所要求的数据非常相似的数据:
id kit items quantity transaction_no is_delivered flow transaction_date receiver_client warehouse
0 57 KIT1182A [{'id': 254, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 258, 'product': {'name': 'Separator Sheet', 'short_code': 'FSS001', 'priceperunit': 170, 'volumetric_weight': 0.9}, 'quantity': 18}] 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1
1 58 KIT1182B [{'id': 259, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 260, 'product': {'name': 'Plastic Sidewall', 'short_code': 'PS001', 'priceperunit': 1250, 'volumetric_weight': 16.1}, 'quantity': 5}, {'id': 261, 'product': {'name': 'Plastic Lid', 'short_code': 'PL001', 'priceperunit': 1250, 'volumetric_weight': 9.7}, 'quantity': 5}] 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1列items在字典中包含有关产品的所有数据。可以以类似于第3行的方式扩展它,但我建议强烈反对它。我稍后再解释原因。因此,第2行根据工具包中物品的数量对每一行进行爆破。第三行提取prod_name和prod_quantity,最后一行提取原始数据。
那么,为什么不应该有一个列数量可变的表呢?你永远不会知道你在每个工具包中有多少物品。您将篡改这些变量列的值。这比字典里有信息还要糟糕。
更新
要按您所要求的方式获得结果,只需运行以下命令:
data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse'] )
tmp = df['items'].apply(lambda it: [{'product'+str(indx+1):x['product']['short_code'], 'quantity'+str(indx+1):x['quantity']} for indx,x in enumerate(it)])
tmp = tmp.apply(lambda x : {k:el[k] for el in x for k in el})
tmp = pd.DataFrame.from_records(tmp)
df = pd.concat([df, tmp], axis=1)
df = df.drop(columns=['items', 'id'])根据您在网上发布的数据,结果是:
kit quantity transaction_no is_delivered flow transaction_date receiver_client warehouse product1 quantity1 product2 quantity2 product3 quantity3 product4 quantity4 product5 quantity5
0 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5 PS002 5 PL001 5 FIN1182A 30 FSS001 18
1 KIT1182B 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5 PS001 5 PL001 5 FIN1182B 20 FSS001 25
2 KIT1151 14 1179 False 1 2020-08-11T04:31:31.245000Z Mahindra & Mahindra_Kandivali Yantraksh Logistics Private limited_GGNPC1 PP001 14 PS001 14 PL001 14 FIN1151A 28 FSS001 42
3 KIT1151 15 1178 False 32 2020-08-10T04:30:12.022000Z Mahindra Vehicle Manufacturers Pune Yantraksh Logistics Private limited_GGNPC1 PP001 15 PS001 15 PL001 15 FIN1151A 29 FSS001 43 发布于 2020-08-20 23:19:23
使用json_normalize,您可以得到以下内容:
data = json.loads(d)
df = pd.json_normalize(data,
record_path=['kits', 'items'],
meta=[
['kits', 'kit'],
['id'],
['kits', 'quantity'],
['warehouse'],
['receiver_client']
],
meta_prefix='top')
print(df)
id quantity product.name ... topkits.quantity topwarehouse topreceiver_client
0 254 5 Plastic Pallet ... 5 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
1 258 18 Separator Sheet ... 5 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
2 259 5 Plastic Pallet ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
3 260 5 Plastic Sidewall ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
4 261 5 Plastic Lid ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limitedhttps://stackoverflow.com/questions/63389329
复制相似问题