我有一个由7百万行和2列组成的大型数据集。但是,第二列包含不同案例的列表。用例的数量会很大(在运行了我的算法后,我找到了10000个第一行的27k+案例)。
这是我所拥有的和我想要的结果的一个样本代表:
我的初始数据格式:
df = pd.DataFrame(columns=["id", "listElements"])
df = df.append([{"id": 1, "listElements": ["apple","peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]"]},
{"id": 2, "listElements": ["ginger","peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ]},
{"id": 3, "listElements": ["steak","beef", "[beef, potatoes]", "banana", ]}]
)
print(df)
# id listElements
#0 1 [apple, peer, [apple, peer], banana, chocolate...
#1 2 [ginger, peer, [ginger, sugar], tofu, [tofu, v...
#2 3 [steak, beef, [beef, potatoes], banana]我的最终目标:对每个元素(或一组元素)进行,获取发生的次数和发生的id。
我现在所做的:爆炸了第二列,然后使用交叉表如下所示:
df2 = df['listElements'].explode()
df = df[['id',]].join(pd.crosstab(df2.index, df2, colnames=['listElements']))
print(df)
#which gives me:
# id [apple, peer] [beef, potatoes] [chocolate, apple] [ginger, sugar] ... chocolate ginger peer steak tofu
#0 1 1 0 1 0 ... 1 0 1 0 0
#1 2 0 0 0 1 ... 1 1 1 0 1
#2 3 0 1 0 0 ... 0 0 0 1 0然后,我正在考虑聚合结果,以获得每种类型元素的计数,并为以后的研究保留id。
我的问题:数据大约有七百万行,我怀疑大约有10万种元素!
我很肯定我的计算机会用这样的数据集耗尽内存,或者需要很长时间来处理!
2问题:
任何见解都是非常欢迎的!如果有什么不清楚,请不要犹豫,请向我要更多的信息!
发布于 2022-03-23 16:59:22
考虑使用长的,而不是广泛的代表性。下面是一个基于凸工具的示例:
from convtools import conversion as c
input_data = [
{ "id": 1, "listElements": [ "apple", "peer", "[apple, peer]", "banana", "chocolate", "[chocolate, apple]", ], },
{ "id": 2, "listElements": [ "ginger", "peer", "[ginger, sugar]", "tofu", "[tofu, veggie]", "chocolate", ], },
{ "id": 3, "listElements": [ "steak", "beef", "[beef, potatoes]", "banana", ], },
]
# generated ad hoc converter function; run on startup and reuse further
converter = (
c.iter(
c.zip(
c.repeat(c.item("id")),
c.item("listElements"),
)
)
.flatten()
.pipe(
c.group_by(c.item(1)).aggregate(
{
"ingredient": c.item(1),
"ids": c.ReduceFuncs.Array(c.item(0)),
"count": c.ReduceFuncs.Count(),
}
)
)
.gen_converter()
)
result = converter(input_data)
assert result == [
{"ingredient": "apple", "ids": [1], "count": 1},
{"ingredient": "peer", "ids": [1, 2], "count": 2},
{"ingredient": "[apple, peer]", "ids": [1], "count": 1},
{"ingredient": "banana", "ids": [1, 3], "count": 2},
{"ingredient": "chocolate", "ids": [1, 2], "count": 2},
{"ingredient": "[chocolate, apple]", "ids": [1], "count": 1},
{"ingredient": "ginger", "ids": [2], "count": 1},
{"ingredient": "[ginger, sugar]", "ids": [2], "count": 1},
{"ingredient": "tofu", "ids": [2], "count": 1},
{"ingredient": "[tofu, veggie]", "ids": [2], "count": 1},
{"ingredient": "steak", "ids": [3], "count": 1},
{"ingredient": "beef", "ids": [3], "count": 1},
{"ingredient": "[beef, potatoes]", "ids": [3], "count": 1},
]此外,获得一个数据集以方便地按成分查询它也是有意义的:
converter = (
c.iter(
c.zip(
c.repeat(c.item("id")),
c.item("listElements"),
)
)
.flatten()
.pipe(
c.group_by(c.item(1)).aggregate(
(
c.item(1),
c.ReduceFuncs.Array(c.item(0)),
)
)
)
.as_type(dict)
.gen_converter()
)
result = converter(input_data)
assert result == {
"apple": [1],
"peer": [1, 2],
"[apple, peer]": [1],
"banana": [1, 3],
"chocolate": [1, 2],
"[chocolate, apple]": [1],
"ginger": [2],
"[ginger, sugar]": [2],
"tofu": [2],
"[tofu, veggie]": [2],
"steak": [3],
"beef": [3],
"[beef, potatoes]": [3],
}https://stackoverflow.com/questions/71537255
复制相似问题