问Python熊猫张量访问非常慢
EN

Stack Overflow用户

提问于 2016-05-19 09:02:29

回答 1查看 106关注 0票数 1

我正在创造一个巨大的张量，数以百万计的单词三值及其计数。例如，一个单词三元组是一个(word0, link, word1)。这些单词三元组被收集在一个字典中，其中值是它们各自的计数，例如(word0, link, word1): 15。想象一下我有几百万这样的三倍。在计算事件发生后，我尝试进行其他计算，这就是我的python脚本被卡住的地方。下面是需要永恒的代码的一部分：

big_tuple = covert_to_tuple(big_dict)
pdf = pd.DataFrame.from_records(big_tuple)
pdf.columns = ['word0', 'link', 'word1', 'counts']
total_cnts = pdf.counts.sum()

for _, row in pdf.iterrows():
    w0, link, w1 = row['word0'], row['link'], row['word1']
    w0w1_link = row.counts

    # very slow
    w0_link = pdf[(pdf.word0 == w0) & (pdf.link == link)]['counts'].sum()
    w1_link = pdf[(pdf.word1 == w1) & (pdf.link == link)]['counts'].sum()

    p_w0w1_link = w0w1_link / total_cnts
    p_w0_link = w0_link / total_cnts
    p_w1_link = w1_link / total_cnts
    new_score = log(p_w0w1_link / (p_w0_link * p_w1_link))
    big_dict[(w0, link, w1)] = new_score

我描述了我的脚本，下面这两行

w0_link = pdf[(pdf.word0 == w0) & (pdf.link == link)]['counts'].sum()  
w1_link = pdf[(pdf.word1 == w1) & (pdf.link == link)]['counts'].sum()

以49%和49%各占计算时间的49%。这些行试图查找(word0, link)和(word1, link)的计数。那么，看起来像这样访问的pdf需要很长时间吗？我能做些什么来优化它吗？

python

pandas

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-05-19 12:30:00

请检查我的解决方案-我在计算中优化了一些东西(希望没有错误：)

# sample of data
df = pd.DataFrame({'word0': list('aabb'), 'link': list('llll'), 'word1': list('cdcd'),'counts': [10, 20, 30, 40]})

# caching total count
total_cnt = df['counts'].sum()

# two series with sums for all combinations of ('word0', 'link') and ('word1', 'link')
grouped_w0_l = df.groupby(['word0', 'link'])['counts'].sum()/total_cnt
grouped_w1_l = df.groupby(['word1', 'link'])['counts'].sum()/total_cnt

# join sums for grouped ('word0', 'link') to original df
merged_w0 = df.set_index(['word0', 'link']).join(grouped_w0_l, how='left', rsuffix='_w0').reset_index()

# join sums for grouped ('word1', 'link') to merged df
merged_w0_w1 = merged_w0.set_index(['word1', 'link']).join(grouped_w1_l, how='left', rsuffix='_w1').reset_index()

# merged_w0_w1 has enough data for calculation new_score
# check here - I transform the expression
merged_w0_w1['new_score'] = np.log(merged_w0_w1['counts'] * total_cnt / (merged_w0_w1['counts_w0'] * merged_w0_w1['counts_w1']))

# export results to dict (don't know is it really needed or not - you can continue manipulate data with dataframes)
big_dict = merged_w0_w1.set_index(['word0', 'link', 'word1'])['new_score'].to_dict()

new_score的表达式是

new_score = log(p_w0w1_link / (p_w0_link * p_w1_link))
        = log(w0w1_link / total_cnts / (w0_link / total_cnts * w0_link / total_cnts))
        = log(w0w1_link / total_cnts * (total_cnts * total_cnts / w0_link * w0_link))
        = log(w0w1_link * total_cnts / (w0_link * w0_link))

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37318554

复制

相似问题

问Python熊猫张量访问非常慢
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python熊猫张量访问非常慢EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python熊猫张量访问非常慢
EN