对于一个生物信息学问题,我发现自己想要在参考序列的汉明距离"k“内创建一个字符串列表。我想这么快就这么做。我在纯python和cython中实现了它,有和没有类型声明。时间表现是一样的。(我还将编译的python版本与ipython中定义的解释版本进行了比较,后者的性能也是相同的。)
cfi被设置为chain.from_iterable的简写,以减少在以下模块级导入和定义中使用的点操作符的数量:
from itertools import chain
cfi = chain.from_iterable
@cython.returns(list)
def PermuteMotifOnce(cython.str motif, set alphabet={"A", "C", "G", "T"}):
"""
Gets all strings within hamming distance 1 of motif and returns it as a
list.
"""
return list(set(cfi([[
motif[:pos] + alpha + motif[pos + 1:] for
alpha in alphabet] for
pos in range(len(motif))])))
def PyPermuteMotifOnce(motif, alphabet={"A", "C", "G", "T"}):
"""
Gets all strings within hamming distance 1 of motif and returns it as a
list.
"""
return list(set(cfi([[
motif[:pos] + alpha + motif[pos + 1:] for
alpha in alphabet] for
pos in range(len(motif))])))
@cython.returns(list)
def PermuteMotifN(cython.str motif, cython.long n=-1):
assert n > 0
cdef set workingSet
cdef cython.long i
workingSet = {motif}
for i in range(n):
workingSet = set(cfi(map(PermuteMotifOnce, workingSet)))
return list(workingSet)
def PyPermuteMotifN(motif, n=-1):
assert n > 0
workingSet = {motif}
for i in range(n):
workingSet = set(cfi(map(PermuteMotifOnce, workingSet)))
return list(workingSet)结果:
motif = "ACCTACTGAACT“%timeit -n 5 PermuteMotifN(motif,6) 5循环,最佳循环为3: 6.93s /循环%timeit -n 5 PyPermuteMotifN(motif,6) 5循环,最佳循环为3: 6.81 s/循环%timeit -n 5000 PyPermuteMotifN(motif,2) 5000循环,最佳循环为3: 589微秒/循环%timeit -n 5000 PermuteMotifN(motif,2) 5000循环,最佳每循环3: 645微秒
是只有我一个人,还是纯Python看起来比Cython还要快?在额外的检查中浪费了很多时间吗?
发布于 2015-05-05 20:01:37
这里的策略是保持一个工作集,并通过在集合中元素的Hamming距离1中添加字符串来反复扩展它。这涉及到大量重复的工作,因为同一个候选字符串将被多次生成(然后丢弃,因为它们已经是工作集的成员)。
我发现,只要准确地生成每个字符串一次,我就能得到一个很大的加速。
from itertools import chain, combinations, product
def hamming_circle(s, n, alphabet):
"""Generate strings over alphabet whose Hamming distance from s is
exactly n.
>>> sorted(hamming_circle('abc', 0, 'abc'))
['abc']
>>> sorted(hamming_circle('abc', 1, 'abc'))
['aac', 'aba', 'abb', 'acc', 'bbc', 'cbc']
>>> sorted(hamming_circle('aaa', 2, 'ab'))
['abb', 'bab', 'bba']
"""
for positions in combinations(range(len(s)), n):
for replacements in product(range(len(alphabet) - 1), repeat=n):
cousin = list(s)
for p, r in zip(positions, replacements):
if cousin[p] == alphabet[r]:
cousin[p] = alphabet[-1]
else:
cousin[p] = alphabet[r]
yield ''.join(cousin)
def hamming_ball(s, n, alphabet):
"""Generate strings over alphabet whose Hamming distance from s is
less than or equal to n.
>>> sorted(hamming_ball('abc', 0, 'abc'))
['abc']
>>> sorted(hamming_ball('abc', 1, 'abc'))
['aac', 'aba', 'abb', 'abc', 'acc', 'bbc', 'cbc']
>>> sorted(hamming_ball('aaa', 2, 'ab'))
['aaa', 'aab', 'aba', 'abb', 'baa', 'bab', 'bba']
"""
return chain.from_iterable(hamming_circle(s, i, alphabet)
for i in range(n + 1))在纯Python中,运行速度大约是原始文章中代码的4倍。
https://codereview.stackexchange.com/questions/88912
复制相似问题