文章/答案/技术大牛

发布

社区首页 >问答首页 >用重复键对值列表进行平均化

问用重复键对值列表进行平均化
EN

Code Review用户

提问于 2015-02-19 23:45:19

回答 1查看 2.2K关注 0票数 4

我有基因表达数据，作为基因列表和值列表来表示。我平均所有相同名字的基因的表达数据。

例如：

genes = ['A', 'C', 'C', 'B', 'A']
vals  = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
         [3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
         [8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
         [4.0, 4.0, 4.0, 3.0], # B: is fine
         [1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0

被转换为

genes = ['A', 'B', 'C']
vals  = [[1.5, 1.5, 5.0, 5.0],
         [4.0, 4.0, 4.0, 3.0],
         [5.5, 5.5, 2.5, 2.5]]

这是我的功能：

def avg_dups(genes, values):
    """Finds duplicate genes and averages their expression data.
    """
    unq_genes = np.unique(genes)
    out_values = np.zeros((unq_genes.shape[0], values.shape[1]))
    for i, gene in enumerate(unq_genes):
        dups = values[genes==gene]
        out_values[i] = np.mean(dups, axis=0)
    return (unq_genes, out_values)

这个函数比我的数据管道的任何其他部分都慢，当在整个数据集上运行的其他步骤也需要子秒时，则需要5-10秒。有什么我能改进的方法吗？

python

performance

algorithm

numpy

回答 1

Code Review用户

回答已采纳

发布于 2015-02-20 00:56:17

这似乎是迄今为止最快的：

import numpy
from numpy import newaxis

def avg_dups(genes, values):
    folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)

    output = numpy.zeros((folded.shape[0], values.shape[1]))
    numpy.add.at(output, indices, values)
    output /= counts[:, newaxis]

    return folded, output

这将找到要折叠值的独特基因，以及映射到同一索引的current index → new index映射和重复值的数量：

    folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)

它将每个当前索引中的行添加到新output中的新索引中：

    output = numpy.zeros((folded.shape[0], values.shape[1]))
    numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values)在output[indices] += values上使用，因为+=中使用的缓冲区破坏了重复索引的代码。

对映射到同一索引的重复值的数目进行简单的除法取平均值：

    output /= counts[:, newaxis]

使用Ashwini Chaudhary的generate_test_data(2000) (给出10000x4数组)，我的大致时间是：

name             time/ms  Author
avg_dups           230    gwg
avg_dups_fast       33    Ashwini Chaudhary
avg_dups_python     45    Ashwini Chaudhary
avg_dups           430    Veedrac
avg_dups             5    Veedrac with Jaime's improvement

票数 3

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/82010

复制

相似问题

问用重复键对值列表进行平均化
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用重复键对值列表进行平均化EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用重复键对值列表进行平均化
EN