首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从JSON构建pandas数据帧

从JSON构建pandas数据帧
EN

Stack Overflow用户
提问于 2019-06-07 17:49:18
回答 2查看 115关注 0票数 0

我正在尝试从mongoDB收集转储创建数据帧。

我引用了这个question来规范化我的数据,但是它不包含文件名和id。

我希望在我的数据帧中有文件名和id。

这是我的json示例。

代码语言:javascript
复制
[
    {'FileName': '32252652D.article.0018038745057751440210.tmp',
     '_id': {'$oid': '5ced0669acd01707cbf2ew33'},    
     'section_details': [{'content': 'Efficient Algorithms for Non-convex Isotonic '
                                     'Regression through Submodular Optimization  ',                                 
                          'heading': 'title'},
                         {'content': 'We consider the minimization of submodular  '
                                     'functions subject to ordering constraints. We show that '
                                     'this potentially non-convex optimization problem can  '
                                     'be cast as a convex optimization problem on a space of  '
                                     'uni-dimensional measures',
                          'heading': 'abstract'},
                         {'content': '', 'heading': 'subject'},
                         {'content': ' Introduction to convex optimization'
                                     'with mean ',
                          'heading': 'Content'}]},
    {'FileName': '32252652D.article.0018038745057751440210.tmp',
     '_id': {'$oid': '5ced0669acd01707cbf2ew11'},    
     'section_details': [{'content': 'Text-Adaptive Generative Adversarial Networks:  '
                                     'Manipulating Images with Natural Language ',
                          'heading': 'title'},
                         {'content': 'This paper addresses the problem of manipulating '
                                     'images using natural language description. Our  '
                                     'task aims to semantically modify visual  '
                                     'attributes of an object in an image according  '
                                     'to the text describing the new visual',
                          'heading': 'abstract'},
                         {'content': '', 'heading': 'subject'},
                         {'content': ' Introduction to Text-Adaptive Generative Adversarial Networks',
                          'heading': 'Content'}]}
]

预期输出

EN

回答 2

Stack Overflow用户

发布于 2019-06-07 17:59:04

请让我知道,如果您愿意输出如下:

代码语言:javascript
复制
>>> import pandas as pd
>>> import json
>>> j = [
...     {'FileName': '32252652D.article.0018038745057751440210.tmp',
...      '_id': {'$oid': '5ced0669acd01707cbf2ew33'},
...      'section_details': [{'content': 'Efficient Algorithms for Non-convex Isotonic '
...                                      'Regression through Submodular Optimization  ',
...                           'heading': 'title'},
...                          {'content': 'We consider the minimization of submodular  '
...                                      'functions subject to ordering constraints. We show that '
...                                      'this potentially non-convex optimization problem can  '
...                                      'be cast as a convex optimization problem on a space of  '
...                                      'uni-dimensional measures',
...                           'heading': 'abstract'},
...                          {'content': '', 'heading': 'subject'},
...                          {'content': ' Introduction to convex optimization'
...                                      'with mean ',
...                           'heading': 'Content'}]},
...     {'FileName': '32252652D.article.0018038745057751440210.tmp',
...      '_id': {'$oid': '5ced0669acd01707cbf2ew11'},
...      'section_details': [{'content': 'Text-Adaptive Generative Adversarial Networks:  '
...                                      'Manipulating Images with Natural Language ',
...                           'heading': 'title'},
...                          {'content': 'This paper addresses the problem of manipulating '
...                                      'images using natural language description. Our  '
...                                      'task aims to semantically modify visual  '
...                                      'attributes of an object in an image according  '
...                                      'to the text describing the new visual',
...                           'heading': 'abstract'},
...                          {'content': '', 'heading': 'subject'},
...                          {'content': ' Introduction to Text-Adaptive Generative Adversarial Networks',
...                           'heading': 'Content'}]}
... ]
>>> pd.DataFrame(j)
                                       FileName                                   _id                                    section_details
0  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew33'}  [{'content': 'Efficient Algorithms for Non-con...
1  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew11'}  [{'content': 'Text-Adaptive Generative Adversa... 
票数 0
EN

Stack Overflow用户

发布于 2019-06-07 20:33:12

可以向json_normalize方法传递要添加到每条记录的元数组。

这里,假设js包含来自原始json的数据,您可以使用:

代码语言:javascript
复制
df = json_normalize(js, 'section_details',['FileName', '_id'])

您将获得:

代码语言:javascript
复制
                                       FileName                                   _id                                            content   heading
0  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew33'}  Efficient Algorithms for Non-convex Isotonic R...     title
1  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew33'}  We consider the minimization of submodular  fu...  abstract
2  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew33'}                                                      subject
3  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew33'}      Introduction to convex optimizationwith mean    Content
4  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew11'}  Text-Adaptive Generative Adversarial Networks:...     title
5  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew11'}  This paper addresses the problem of manipulati...  abstract
6  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew11'}                                                      subject
7  32252652D.article.0018038745057751440210.tmp  {'$oid': '5ced0669acd01707cbf2ew11'}   Introduction to Text-Adaptive Generative Adve...   Content

在此之后,您仍然需要修复_id列并透视数据帧。最后,你可以这样结束:

代码语言:javascript
复制
# extract relevant infos
df = json_normalize(js, 'section_details',['FileName', '_id'])

# fix _id column
df['_id'] = df['_id'].apply(lambda x: x['$oid'])

# pivot to get back the expected columns
resul = df.groupby('FileName').apply(lambda x: x.pivot(
    '_id', 'heading', 'content')).reset_index().rename_axis('', axis=1)

或者,您可以直接从原始json的每一行手工构建数据帧行:

代码语言:javascript
复制
resul = pd.DataFrame([dict([('FileName',j['FileName']), ('_id', j['_id']['$oid'])]
                           +list({sd['heading']: sd['content'] for sd in j['section_details']
                                 }.items())) for j in js]).reindex(columns=['FileName',
                                            '_id', 'title', 'abstract', 'subject', 'Content']
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56491974

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档