我有基因表达数据,作为基因列表和值列表来表示。我平均所有相同名字的基因的表达数据。
例如:
genes = ['A', 'C', 'C', 'B', 'A']
vals = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
[3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
[8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
[4.0, 4.0, 4.0, 3.0], # B: is fine
[1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0被转换为
genes = ['A', 'B', 'C']
vals = [[1.5, 1.5, 5.0, 5.0],
[4.0, 4.0, 4.0, 3.0],
[5.5, 5.5, 2.5, 2.5]]这是我的功能:
def avg_dups(genes, values):
"""Finds duplicate genes and averages their expression data.
"""
unq_genes = np.unique(genes)
out_values = np.zeros((unq_genes.shape[0], values.shape[1]))
for i, gene in enumerate(unq_genes):
dups = values[genes==gene]
out_values[i] = np.mean(dups, axis=0)
return (unq_genes, out_values)这个函数比我的数据管道的任何其他部分都慢,当在整个数据集上运行的其他步骤也需要子秒时,则需要5-10秒。有什么我能改进的方法吗?
发布于 2015-02-20 00:56:17
这似乎是迄今为止最快的:
import numpy
from numpy import newaxis
def avg_dups(genes, values):
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= counts[:, newaxis]
return folded, output这将找到要折叠值的独特基因,以及映射到同一索引的current index → new index映射和重复值的数量:
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)它将每个当前索引中的行添加到新output中的新索引中:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)numpy.add.at(output, indices, values)在output[indices] += values上使用,因为+=中使用的缓冲区破坏了重复索引的代码。
对映射到同一索引的重复值的数目进行简单的除法取平均值:
output /= counts[:, newaxis]使用Ashwini Chaudhary的generate_test_data(2000) (给出10000x4数组),我的大致时间是:
name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvementhttps://codereview.stackexchange.com/questions/82010
复制相似问题