文章/答案/技术大牛

发布

社区首页 >问答首页 >改进pandas应用函数性能

问改进pandas应用函数性能
EN

Stack Overflow用户

提问于 2021-11-02 13:19:25

回答 3查看 110关注 0票数 1

我有一个pandas数据帧，它的列包含字典。我也有一个查询字典，我想要计算公共键的值的最小和。

例如

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
common keys = 'a', 'b'
s1 = dicta['a'] + dicta['b']
s2 = dictb['a'] + dictb['b']
result = min(s1, s2) = 2

我使用下面的代码来计算它。

def compute_common(dict1, dict2):

    common_keys = dict1.keys() & dict2.keys()
    im_count1 = sum((dict1[k] for k in common_keys))
    im_count2 = sum((dict2[k] for k in common_keys))
    return int(min(im_count1, im_count2))

以下是我的8 8GB内存的i7 8核心机器上的操作时间。

%timeit df['a'].apply(lambda x:compute_common(dictb, x))
55.2 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

我还发现，我可以使用swifter来提高pandas apply的性能(通过在内部使用多进程)

%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
66.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

使用swifter甚至更慢(可能是因为多处理的开销)。我想知道是否有任何方法可以从这个操作中获得更多的性能。

您可以使用以下内容来复制内容。

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
df = pd.DataFrame({'a': [dicta] * 30000})

%timeit df['a'].apply(lambda x:compute_common(dictb, x))
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))

提前谢谢。

python

pandas

swifter

回答 3

Stack Overflow用户

发布于 2021-11-02 13:49:49

使用列表理解来查找公用键的值，然后对列表结果求和，找出两个字典总和的公用键值之间的最小值。common_keys被附加到创建'a‘、'b’列表中。然后，列表理解找到a和b的值，并将它们相加等于26和2。26和2的最小值是2。

def find_common_keys(dicta, dictb):
     '''
     >>> find_common_keys({'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}, {'a': 1, 
     'b': 1, 't': 34, 'g': 56, 'h': 67})
      2
      '''
    common_keys = [key  for key in dicta if key in dictb]

    s1 = sum(dicta[key] for key in common_keys)
    s2 = sum(dictb[key] for key in common_keys)
    return min(s1, s2)

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}

print(find_common_keys(dicta,dictb))

输出

票数 1

Stack Overflow用户

发布于 2021-11-02 13:33:48

您可以将字典分解为数据帧并对其求和。

dict_data = pd.DataFrame(df['a'].tolist())

common_keys = dict_data.columns.intersection(dictb.keys())

dictb_sum = sum(dictb[k] for k in common_keys)

dicta_sum = dict_data[common_keys].sum(1)

# also     
output = dicta_sum.clip(upper=dictb_sum)

这比我系统上的apply快两倍。请注意，如果union(x.keys() for x in df['a'])不是太大，因为dict_data的所有列都很大，但足够大，以便您可以利用矢量化的.sum(1)。

票数 0

Stack Overflow用户

发布于 2021-11-10 16:36:20

以下是我的一些发现。分享它们，这样它才能帮助别人。以下是我能够实现的优化。我试着扩展@Golden的想法。

只需使用cython编译函数，即可提供10%的性能提升。
由于python是松散类型的，因此使用类型编写cython函数会进一步增加min。由于python中的函数调用代价很高，因此将min(x1，x2)转换为x1 if x1 < x2 else x2会带来性能优势。

我使用的最后一个函数让我的性能提升了3倍。

cpdef int cython_common(dict_1, dict_2):
    cdef dict dict1 = dict_1[0]
    cdef dict dict2 = dict_2[0]
    cdef list common_keys = [key  for key in dict1 if key in dict2]
    cdef int sum1 = 0
    cdef int sum2 = 0
    for i in common_keys:
        sum1 += dict1[i]
        sum2 +=dict2[i]
    return sum1 if sum1 < sum2 else sum2

此外，通过一些实验，我发现像pandarallel和swifter这样的库在数据集具有大量行时会产生加速(对于较少的行数，我认为派生过程和合并结果的开销比计算本身要大得多。

此外，this也是一个很棒的读物。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69811182

复制

相似问题

问改进pandas应用函数性能
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问改进pandas应用函数性能EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问改进pandas应用函数性能
EN