首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >我如何使用熊猫爆炸,交叉表和数量的数量发生在大型数据集?(7毫米行)

我如何使用熊猫爆炸,交叉表和数量的数量发生在大型数据集?(7毫米行)
EN

Stack Overflow用户
提问于 2022-03-19 10:13:16
回答 1查看 233关注 0票数 1

我有一个由7百万行和2列组成的大型数据集。但是,第二列包含不同案例的列表。用例的数量会很大(在运行了我的算法后,我找到了10000个第一行的27k+案例)。

这是我所拥有的和我想要的结果的一个样本代表:

我的初始数据格式:

代码语言:javascript
复制
df = pd.DataFrame(columns=["id", "listElements"])
df = df.append([{"id": 1, "listElements": ["apple","peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]"]},
{"id": 2, "listElements": ["ginger","peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ]},
{"id": 3, "listElements": ["steak","beef", "[beef, potatoes]", "banana", ]}]

)

print(df)
#  id                                       listElements
#0  1  [apple, peer, [apple, peer], banana, chocolate...
#1  2  [ginger, peer, [ginger, sugar], tofu, [tofu, v...
#2  3            [steak, beef, [beef, potatoes], banana]

我的最终目标:对每个元素(或一组元素)进行,获取发生的次数和发生的id。

我现在所做的:爆炸了第二列,然后使用交叉表如下所示:

代码语言:javascript
复制
df2 = df['listElements'].explode()
df = df[['id',]].join(pd.crosstab(df2.index, df2, colnames=['listElements']))
print(df)

#which gives me:
#  id  [apple, peer]  [beef, potatoes]  [chocolate, apple]  [ginger, sugar]  ...  chocolate  ginger  peer  steak  tofu     
#0  1              1                 0                   1                0  ...          1       0     1      0     0     
#1  2              0                 0                   0                1  ...          1       1     1      0     1     
#2  3              0                 1                   0                0  ...          0       0     0      1     0

然后,我正在考虑聚合结果,以获得每种类型元素的计数,并为以后的研究保留id。

我的问题:数据大约有七百万行,我怀疑大约有10万种元素!

我很肯定我的计算机会用这样的数据集耗尽内存,或者需要很长时间来处理!

2问题:

  1. 是否有更直接、更快的方法来产生我的结果(也许我正在做一些不必要的步骤?)
  2. 如何避免内存或速度问题?我可以通过把元素转换成数字来加速算法吗?还是在加入结果之前分批运行?

任何见解都是非常欢迎的!如果有什么不清楚,请不要犹豫,请向我要更多的信息!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-03-23 16:59:22

考虑使用长的,而不是广泛的代表性。下面是一个基于凸工具的示例:

代码语言:javascript
复制
from convtools import conversion as c


input_data = [
    { "id": 1, "listElements": [ "apple", "peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]", ], },
    { "id": 2, "listElements": [ "ginger", "peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ], },
    { "id": 3, "listElements": [ "steak", "beef", "[beef, potatoes]", "banana", ], },
]

# generated ad hoc converter function; run on startup and reuse further
converter = (
    c.iter(
        c.zip(
            c.repeat(c.item("id")),
            c.item("listElements"),
        )
    )
    .flatten()
    .pipe(
        c.group_by(c.item(1)).aggregate(
            {
                "ingredient": c.item(1),
                "ids": c.ReduceFuncs.Array(c.item(0)),
                "count": c.ReduceFuncs.Count(),
            }
        )
    )
    .gen_converter()
)

result = converter(input_data)

assert result == [
    {"ingredient": "apple", "ids": [1], "count": 1},
    {"ingredient": "peer", "ids": [1, 2], "count": 2},
    {"ingredient": "[apple, peer]", "ids": [1], "count": 1},
    {"ingredient": "banana", "ids": [1, 3], "count": 2},
    {"ingredient": "chocolate", "ids": [1, 2], "count": 2},
    {"ingredient": "[chocolate, apple]", "ids": [1], "count": 1},
    {"ingredient": "ginger", "ids": [2], "count": 1},
    {"ingredient": "[ginger, sugar]", "ids": [2], "count": 1},
    {"ingredient": "tofu", "ids": [2], "count": 1},
    {"ingredient": "[tofu, veggie]", "ids": [2], "count": 1},
    {"ingredient": "steak", "ids": [3], "count": 1},
    {"ingredient": "beef", "ids": [3], "count": 1},
    {"ingredient": "[beef, potatoes]", "ids": [3], "count": 1},
]

此外,获得一个数据集以方便地按成分查询它也是有意义的:

代码语言:javascript
复制
converter = (
    c.iter(
        c.zip(
            c.repeat(c.item("id")),
            c.item("listElements"),
        )
    )
    .flatten()
    .pipe(
        c.group_by(c.item(1)).aggregate(
            (
                c.item(1),
                c.ReduceFuncs.Array(c.item(0)),
            )
        )
    )
    .as_type(dict)
    .gen_converter()
)

result = converter(input_data)
assert result == {
    "apple": [1],
    "peer": [1, 2],
    "[apple, peer]": [1],
    "banana": [1, 3],
    "chocolate": [1, 2],
    "[chocolate, apple]": [1],
    "ginger": [2],
    "[ginger, sugar]": [2],
    "tofu": [2],
    "[tofu, veggie]": [2],
    "steak": [3],
    "beef": [3],
    "[beef, potatoes]": [3],
}
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71537255

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档