文章/答案/技术大牛

发布

社区首页 >问答首页 >将两个具有一对多关系的CSV表转换为具有嵌入子文档列表的JSON

问将两个具有一对多关系的CSV表转换为具有嵌入子文档列表的JSON
EN

Stack Overflow用户

提问于 2020-12-13 11:21:55

回答 2查看 106关注 0票数 0

我有两个CSV文件，它们之间有一对多的关系。

main.csv

"main_id","name"
"1","foobar"

attributes.csv

"id","main_id","name","value","updated_at"
"100","1","color","red","2020-10-10"
"101","1","shape","square","2020-10-10"
"102","1","size","small","2020-10-10"

我想将其转换为此结构的JSON：

[
  {
    "main_id": "1",
    "name": "foobar",
    "attributes": [
      {
        "id": "100",
        "name": "color",
        "value": "red",
        "updated_at": "2020-10-10"
      },
      {
        "id": "101",
        "name": "shape",
        "value": "square",
        "updated_at": "2020-10-10"
      },
      {
        "id": "103",
        "name": "size",
        "value": "small",
        "updated_at": "2020-10-10"
      }
    ]
  }
]

我尝试使用Python和Pandas，比如：

import pandas

def transform_group(group):
    group.reset_index(inplace=True)
    group.drop('main_id', axis='columns', inplace=True)
    return group.to_dict(orient='records')

main = pandas.read_csv('main.csv')
attributes = pandas.read_csv('attributes.csv', index_col=0)

attributes = attributes.groupby('main_id').apply(transform_group)
attributes.name = "attributes"

main = main.merge(
    right=attributes,
    on='main_id',
    how='left',
    validate='m:1',
    copy=False,
)

main.to_json('out.json', orient='records', indent=2)

它起作用了。但问题是，它似乎没有规模。在我的整个数据集上运行时，我可以加载单独的CSV文件而不会出现问题，但当尝试在调用to_json之前修改数据结构时，内存使用量会爆炸式增长。

那么，有没有一种更有效的方法来完成这种转换呢？也许我遗漏了熊猫的一些特性？或者是否有其他库可供使用？此外，apply在这里的使用似乎相当缓慢。

python

json

pandas

csv

回答 2

Stack Overflow用户

发布于 2020-12-13 12:11:11

这是一个棘手的问题，我们都感受到了你的痛苦。

我有三种方法来解决这个问题。首先，如果你允许熊猫爆发，groupby会变得更慢。

import pandas as pd
import numpy as np
from collections import defaultdict

df = pd.DataFrame({'id': np.random.randint(0, 100, 5000), 
                   'name': np.random.randint(0, 100, 5000)})

现在，如果您使用标准的groupby

groups = []
for k, rows in df.groupby('id'):
    groups.append(rows)

你会发现

groups = defaultdict(lambda: [])
for id, name in df.values:
    groups[id].append((id, name))

大约快3倍。

第二种方法是我将它改为使用Dask和dask并行化。关于dask的讨论是what is dask and how is it different from pandas。

第三个是算法。加载主文件，然后按ID，然后只加载该ID的数据，对内存中的内容和磁盘中的内容具有多个字节，然后在部分结果可用时保存它。

票数 1

Stack Overflow用户

发布于 2020-12-18 02:17:28

因此，在我的例子中，我可以将原始表加载到内存中，但嵌入导致大小爆炸，因此它不再适合内存。因此，我最终仍然使用Pandas加载CSV文件，但随后我逐行迭代生成，并将每行保存到单独的JSON中。这意味着我在内存中没有用于一个大型JSON的大型数据结构。

另一个重要的认识是，将相关列作为索引是很重要的，并且必须对其进行排序，以便快速查询它(因为通常相关列中有重复的条目)。

我创建了以下两个helper函数：

def get_related_dict(related_table, label):
    assert related_table.index.is_unique

    if pandas.isna(label):
        return None

    row = related_table.loc[label]

    assert isinstance(row, pandas.Series), label

    result = row.to_dict()
    result[related_table.index.name] = label

    return result


def get_related_list(related_table, label):
    # Important to be more performant when selecting non-unique labels.
    assert related_table.index.is_monotonic_increasing

    try:
        # We use this syntax for always get a DataFrame and not a Series when there is only one row matching.
        return related_table.loc[[label], :].to_dict(orient='records')
    except KeyError:
        return []

然后我就做了：

main = pandas.read_csv('main.csv', index_col=0)
attributes = pandas.read_csv('attributes.csv', index_col=1)

# We sort index to be more performant when selecting non-unique labels. We use stable sort.
attributes.sort_index(inplace=True, kind='mergesort')

columns = [main.index.name] + list(main.columns)
for row in main.itertuples(index=True, name=None):
    assert len(columns) == len(row)
    data = dict(zip(columns, row))

    data['attributes'] = get_related_list(attributes, data['main_id'])

    json.dump(data, sys.stdout, indent=2)
    sys.stdout.write("\n")

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65272050

复制

相似问题

问将两个具有一对多关系的CSV表转换为具有嵌入子文档列表的JSON
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将两个具有一对多关系的CSV表转换为具有嵌入子文档列表的JSONEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将两个具有一对多关系的CSV表转换为具有嵌入子文档列表的JSON
EN