文章/答案/技术大牛

发布

社区首页 >问答首页 >Pandas Dataframe性能与列表性能

问Pandas Dataframe性能与列表性能
EN

Stack Overflow用户

提问于 2016-09-28 08:19:52

回答 1查看 8.5K关注 0票数 9

我正在比较两个数据帧，以确定df1中的行是否从df2中的任何行开始。df1有上千个条目，df2有数百万个。

这可以完成这项工作，但速度相当慢。

df1['name'].map(lambda x: any(df2['name'].str.startswith(x)))

在df1 (10个项目)的子集上运行时，结果如下：

35243     True
39980    False
40641    False
45974    False
53788    False
59895     True
61856    False
81083     True
83054     True
87717    False
Name: name, dtype: bool
Time: 57.8873581886 secs

当我将df2转换为列表时，它的运行速度要快得多：

df2_list = df2['name'].tolist()

df1['name'].map(lambda x: any(item.startswith(x + ' ') for item in df2_list))

35243     True
39980    False
40641    False
45974    False
53788    False
59895     True
61856    False
81083     True
83054     True
87717    False
Name: name, dtype: bool
Time: 33.0746209621 secs

为什么遍历列表比遍历序列更快？

python

pandas

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-09-28 18:25:28

当any()获得True值时，它会提前返回，因此startswith()调用比Dataframe版本少。

下面是一个使用searchsorted()方法

import random, string
import pandas as pd
import numpy as np

def randomword(length):
    return ''.join(random.choice(string.ascii_lowercase) for i in range(length))


xs = pd.Series([randomword(3) for _ in range(1000)])
ys = pd.Series([randomword(10) for _ in range(10000)])

def is_any_prefix1(xs, ys):
    yo = ys.sort_values().reset_index(drop=True)
    y2 = yo[yo.searchsorted(xs)]
    return np.fromiter(map(str.startswith, y2, xs), dtype=bool)

def is_any_prefix2(xs, ys):
    x = xs.tolist()
    y = ys.tolist()
    return np.fromiter((any(yi.startswith(xi) for yi in y) for xi in x), dtype=bool)

res1 = is_any_prefix1(xs, ys)
res2 = is_any_prefix2(xs, ys)
print(np.all(res1 == res2))

%timeit is_any_prefix1(xs, ys)
%timeit is_any_prefix2(xs, ys)

输出：

True
100 loops, best of 3: 17.8 ms per loop
1 loop, best of 3: 2.35 s per loop

它快了100倍。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39736195

复制

相似问题

问Pandas Dataframe性能与列表性能
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas Dataframe性能与列表性能EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas Dataframe性能与列表性能
EN