我试图从下面的json数据中获取嵌套的值。
{
"region_id": 60763,
"phone": "",
"address": {
"region": "NY",
"street-address": "147 West 43rd Street",
"postal-code": "10036",
"locality": "New York City"
},
"id": 113317,
"name": "Casablanca Hotel Times Square"
}
{
"region_id": 32655,
"phone": "",
"address": {
"region": "CA",
"street-address": "300 S Doheny Dr",
"postal-code": "90048",
"locality": "Los Angeles"
},
"id": 76049,
"name": "Four Seasons Hotel Los Angeles at Beverly Hills"
}我刚刚使用以下方法将上述数据加载到我的熊猫数据框架中:
with open("file path") as f:
df = pd.DataFrame(json.loads(line) for line in f)现在我的数据框架如下所示:
address Phone
0 {u'region': u'NY', u'street-address': u'147 We...
1 {u'region': u'CA', u'street-address': u'300 S ...
id name region_id
0 113317 Casablanca Hotel Times Square 60763
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655 我可以使用这个- data = df[['id', 'name']]获得列子集。
但不确定如何获得region和street-address以及id和name的值。我的输出数据框架应该有id, name, region, street-address。
注意:我尝试弹出并将这个嵌套列address与我的数据框架连接起来。但是由于我的数据是巨大的-348 my,所以当我尝试按列排列时,连接将占用大量内存(轴- 1)。
另外,我正在寻找一种有效的方法来处理这个问题,如果我使用Numpy,它将直接使用C扩展。或者写到像MongoDB这样的数据库中。我之所以考虑这一点,是因为在对这些数据进行减除之后,我需要基于id列加入这个其他数据集,以获得很少的其他字段。
发布于 2016-02-20 12:54:42
本地Pandas解决方案- 正常化()
更正和工作版本:
import ujson
import pandas as pd
from pandas.io.json import json_normalize
pd.set_option('display.expand_frame_repr', False)
with open('aaa') as f:
data = ujson.load(f)
df = json_normalize(data)[['id', 'name', 'address.region', 'address.street-address']].rename(columns={'address.region': 'region', 'address.street-address': 'street-address'})
print(df)输出:
id name region street-address
0 113317 Casablanca Hotel Times Square NY 147 West 43rd Street
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills CA 300 S Doheny Dr不起作用的版本(正如Cleb所指出的):
import ujson
from pandas.io.json import json_normalize
with open('data.json') as f:
data = ujson.load(f)
df = json_normalize(data, 'address', ['region', 'street-address'])
pd.set_option('display.expand_frame_repr', False)
print(df)或者,您可以使用ujson (Ultra )来生成字典列表,然后从其中生成一个DataFrame:
import ujson
import pandas as pd
data_list = []
with open('data.json') as f:
for line in f:
d = ujson.loads(line)
data_list.append(
{"id":d["id"],
"name":d["name"],
"region":d["address"]["region"],
"street-address":d["address"]["street-address"]
}
)
df = pd.DataFrame(data_list)
pd.set_option('display.expand_frame_repr', False)
print(df)我不知道哪种解决方案会更有效/更快
如果可能的话,只调用pd.DataFrame/pd.read_json一次,否则会慢得多。
输出:
id name region street-address
0 113317 Casablanca Hotel Times Square NY 147 West 43rd Street
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills CA 300 S Doheny Dr发布于 2016-02-20 12:42:02
下面的内容可以使用(不过,我在下面添加了一个更有效的解决方案;只需向下滚动到编辑):
import pandas as pd
# read the updated json file
df = pd.read_json('data.json')
# convert column with the nested json structure
tempdf = pd.concat([pd.DataFrame.from_dict(item, orient='index').T for item in df.address])
# get rid of the converted column
df.drop('address', 1, inplace=True)
# prepare concat
tempdf.index = df.index
# merge the two dataframes back together
df = pd.concat([df, tempdf], axis=1)输出:
id name phone region_id \
0 113317 Casablanca Hotel Times Square 60763
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655
region street-address postal-code locality
0 NY 147 West 43rd Street 10036 New York City
1 CA 300 S Doheny Dr 90048 Los Angeles 现在,您可以使用drop命令去除不需要的列。
我修改了您的json文件,它实际上是无效的;您可以检查它,例如在JSONLint上
[{
"region_id": 60763,
"phone": "",
"address": {
"region": "NY",
"street-address": "147 West 43rd Street",
"postal-code": "10036",
"locality": "New York City"
},
"id": 113317,
"name": "Casablanca Hotel Times Square"
}, {
"region_id": 32655,
"phone": "",
"address": {
"region": "CA",
"street-address": "300 S Doheny Dr",
"postal-code": "90048",
"locality": "Los Angeles"
},
"id": 76049,
"name": "Four Seasons Hotel Los Angeles at Beverly Hills"
}]编辑
基于@MaxU的回答(对我不起作用),您还可以做以下工作:
import pandas as pd
import ujson
from pandas.io.json import json_normalize
# this is the json file from above
with open('data.json') as f:
data = ujson.load(f)现在,正如@MaxU所建议的,您可以使用规格化来消除嵌套结构:
df3 = json_normalize(data)这给了你:
address.locality address.postal-code address.region address.street-address id name phone region_id
0 New York City 10036 NY 147 West 43rd Street 113317 Casablanca Hotel Times Square 60763
1 Los Angeles 90048 CA 300 S Doheny Dr 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655您可以将要保留的列重命名如下:
df3.rename(columns={'address.region': 'region', 'address.street-address': 'street-address'}, inplace=True)然后选择要保留的列:
df3 = df3[['id', 'name', 'region', 'street-address']]它提供了所需的输出:
id name region street-address
0 113317 Casablanca Hotel Times Square NY 147 West 43rd Street
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills CA 300 S Doheny Dr发布于 2016-02-20 12:33:30
一个小助手函数可以做到这一点:
def get_entries(line):
data = json.loads(line)
res = {k: data[k] for k in ['id', 'name']}
res.update({k: data['address'][k] for k in ['region', 'street-address']})
return res
with open("file path") as f:
df = pd.DataFrame(get_entries(line) for line in f)输出:
id name region \
0 113317 Casablanca Hotel Times Square NY
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills CA
street-address
0 147 West 43rd Street
1 300 S Doheny Dr 或者,看起来更好看:

https://stackoverflow.com/questions/35522653
复制相似问题