我有一个从Firebase导出的json文件,如下所示。
{
"reports" : {
"Google-Pixel 2 XL" : {
"-MIoCtD9YUF2G9Esfrfz" : {
"message" : "04 Oct 2020 23:25:17:047 onCreate MainActivity",
"timestamp" : 1601825117067
},
"-MIoCtFVOxu8wdEHtm6q" : {
"message" : "04 Oct 2020 23:25:17:214 onCreate Service",
"timestamp" : 1601825117216
},
"-MIoCyBtKMQqQzUHEXsW" : {
"message" : "04 Oct 2020 23:25:37:682 onStartCommand Service",
"timestamp" : 1601825137685
},
"-MIoFWll9r3qwzWNoGMn" : {
"message" : "04 Oct 2020 23:36:47:687: (1.3212517, 103.860314)",
"timestamp" : 1601825807693
}
},
"Vivo 1820" : {
"-MIoF14JUm6JMZrOzDlL" : {
"message" : "04 Oct 2020 23:34:37:623 onCreate MainActivity",
"timestamp" : 1601825677653
},
"-MIoF1A9ZZNqTu5W-rQD" : {
"message" : "04 Oct 2020 23:34:38:016 onCreate Service",
"timestamp" : 1601825678026
},
"-MIoF2gNDua9FfLBTg6q" : {
"message" : "04 Oct 2020 23:34:44:235 onCreate MainActivity",
"timestamp" : 1601825684248
}
}
}
}我正在尝试将其转换为如下所示的4列数据格式
model | id | message | timestamp
Google-Pixel 2 XL | -MIoCtD9YUF2G9Esfrfz | 04 Oct 2020 23:25:17:047 onCreate... | 1601825117067
Vivo 1820 | -MIoF14JUm6JMZrOzDlL | 04 Oct 2020 23:34:37:623 onCreate... | 1601825677653我该怎么做?我尝试了各种方法,并使之正常化,但似乎无法理解。
data = pd.read_json("firebase-file.json")
df = pd.json_normalize(data, record_path="reports")谢谢。
发布于 2020-10-05 04:49:58
我认为你需要自己把它压平,好的是它并不复杂:
s = [[k, i, *j.values()] for k,v in data["reports"].items() for i, j in v.items()]
print (pd.DataFrame(s))
0 1 2 3
0 Google-Pixel 2 XL -MIoCtD9YUF2G9Esfrfz 04 Oct 2020 23:25:17:047 onCreate MainActivity 1601825117067
1 Google-Pixel 2 XL -MIoCtFVOxu8wdEHtm6q 04 Oct 2020 23:25:17:214 onCreate Service 1601825117216
2 Google-Pixel 2 XL -MIoCyBtKMQqQzUHEXsW 04 Oct 2020 23:25:37:682 onStartCommand Service 1601825137685
3 Google-Pixel 2 XL -MIoFWll9r3qwzWNoGMn 04 Oct 2020 23:36:47:687: (1.3212517, 103.860314) 1601825807693
4 Vivo 1820 -MIoF14JUm6JMZrOzDlL 04 Oct 2020 23:34:37:623 onCreate MainActivity 1601825677653
5 Vivo 1820 -MIoF1A9ZZNqTu5W-rQD 04 Oct 2020 23:34:38:016 onCreate Service 1601825678026
6 Vivo 1820 -MIoF2gNDua9FfLBTg6q 04 Oct 2020 23:34:44:235 onCreate MainActivity 1601825684248发布于 2020-10-05 03:29:11
试试这个(见上面我的评论)
import pandas as pd
data = []
for k, v in test['reports'].items():
model_name = k
for model in v.items():
_data = {}
_data['model'] = model_name
_data['id'] = model[0]
_data['message'] = model[1]['message']
_data['timestamp'] = model[1]['timestamp']
data.append(_data)
df = pd.DataFrame(data)其中test是您的数据,因此test['reports']访问要解析的嵌套信息。
发布于 2020-10-05 03:29:52
根据官方文件 of pd.json_normalize(),它假设一个数组(列表)输入。然而,最初的json与dicts列表相去甚远,最重要的是,键"id“并不存在。因此,我认为绝对需要一个手工的解析器。
码
import pandas as pd
import json
file_path = "/mnt/ramdisk/in.json"
with open(file_path) as f:
dic = json.load(f)
# discard the redundant "report" layer
dic = dic["reports"]
# produce a flattened list of dict
ls = []
for k1, v1 in dic.items():
# k1 = model
for k2, v2 in v1.items():
# k2 = the hash-like id
v2["model"] = k1
v2["id"] = k2
ls.append(v2)
df = pd.json_normalize(ls)输出
# Trim the message for printing purpose
df2 = df.copy()
df2["message"] = df["message"].apply(lambda s: s[:10])
df2
Out[28]:
message timestamp model id
0 04 Oct 202 1601825117067 Google-Pixel 2 XL -MIoCtD9YUF2G9Esfrfz
1 04 Oct 202 1601825117216 Google-Pixel 2 XL -MIoCtFVOxu8wdEHtm6q
2 04 Oct 202 1601825137685 Google-Pixel 2 XL -MIoCyBtKMQqQzUHEXsW
3 04 Oct 202 1601825807693 Google-Pixel 2 XL -MIoFWll9r3qwzWNoGMn
4 04 Oct 202 1601825677653 Vivo 1820 -MIoF14JUm6JMZrOzDlL
5 04 Oct 202 1601825678026 Vivo 1820 -MIoF1A9ZZNqTu5W-rQD
6 04 Oct 202 1601825684248 Vivo 1820 -MIoF2gNDua9FfLBTg6q注意:深入到类似散列的id所定位的层似乎是必要的。这是因为id最初是keys,但似乎必须将它们重新格式化为values,才能被pd.json_normalize正确解释为值。我在互联网上的简单调查也没有发现使用一种简单的内置方法来解析这样一个递归结构的例子。
https://stackoverflow.com/questions/64201919
复制相似问题