我需要一些帮助来提高以下代码的性能。
for object in dict_of_objects.values():
test = pd.Series(object.properties) #properties is a dict
series_list.append(test)
# List comprehension is not really faster than the loop since pd.Series() takes most time
#series_list = [pd.Series(object.properties) for object in dict_of_objects.values()]
# Also very slow
df = pd.DataFrame(series_list)在对代码进行计时之后,我发现pd.Series(object.properties)和pd.DataFrame(series_list)非常慢--它们都需要大约9s才能完成,而附加只需要0.4s。因此,列表理解并不是真正的改进,因为它也调用了pd.Series(object.properties)。
你对如何提高这方面的表现有什么建议吗?
最好的,朱尔兹
发布于 2019-10-17 14:47:23
让我们来看看一些代码片段:
import numpy as np
import pandas as pd
from copy import deepcopy as cp
N_objects = 10
N_samples = 10000
class SimpleClass:
def __init__(self,prop):
self.properties = prop
dict_of_objects = {'obj{}'.format(i):
SimpleClass({
'alice' : np.random.rand(N_samples),
'bob' : np.random.rand(N_samples)
}) for i in range(N_objects)}
def slow_update(dict_of_objects):
series_list = []
for obj in dict_of_objects.values():
test = pd.Series(obj.properties)
series_list.append(test)
return pd.DataFrame(series_list)
def med_update(dict_of_objects):
return pd.DataFrame([pd.Series(obj.properties) for obj in dict_of_objects.values()])
def fast_update(dict_of_objects):
keys = iter(dict_of_objects.values()).__next__().properties.keys()
return pd.DataFrame({k: [obj.properties[k] for obj in dict_of_objects.values()] for k in keys})随着时间的推移:
>>> %timeit slow_update(dict_of_objects)
2.88 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit med_update(dict_of_objects)
2.86 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit fast_update(dict_of_objects)
344 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)快速更新完成以下操作:
使用__next__.
它比大多数方法快8倍。
编辑:正如@koPytok正确指出的那样,如果每个对象的properties属性有不同的键,则fast_update将无法工作。这一点值得记住,如果您选择为诸如NoSQL数据库抓取-在MongoDB中实现这一点,文档不需要共享相同的字段(在这里,交换文档为对象,字段为键)。
享受吧!
发布于 2019-10-17 14:35:06
同样的结果也可以实现,例如,如下所示:
properties_list = [o.properties for o in dict_of_objects.values()]
df = pd.DataFrame(properties_list).T或者使用属性的dict(),这需要更少的操作:
properties_dict = {k: o.properties for k, o in dict_of_objects.items()}
df = pd.DataFrame.from_dict(properties_dict)https://stackoverflow.com/questions/58434860
复制相似问题