文章/答案/技术大牛

发布

社区首页 >问答首页 >在python中查找大字符串中所有子字符串的最快方法是什么？

问在python中查找大字符串中所有子字符串的最快方法是什么？
EN

Stack Overflow用户

提问于 2016-09-03 20:15:08

回答 3查看 2.3K关注 0票数 4

编辑以澄清输入/输出。我认为无论如何，这都会有点慢，但到目前为止，我还没有真正考虑到我的python脚本中的速度，我正在试图找到加速这样的操作的方法。

我的输入是被腌制的基因组序列字典。我目前正在研究两个基因组，一个是出芽酵母基因组(磁盘上有11.5 MB )，另一个是人类基因组(2.8GB)。这些字典的形式如下：

seq_d = { 'chr1' : 'ATCGCTCGCTGCTCGCT', 'chr2' : 'CGATCAGTCATGCATGCAT', 
        'chr3' : 'ACTCATCATCATCATACTGGC' }

我想在基因组的两条链中找到所有核苷酸的单碱基实例。其中，+链是指上述字典中的序列，而-链是序列的反向补充。我的输出是一个嵌套字典，其中顶级的keys是+或-，嵌套的keys是染色体名，值是0索引位置的列表：

nts = 'T'
test_d = {'+': {'chr3': [2, 5, 8, 11, 14, 17], 'chr2': [3, 7, 10, 14, 18], 
           'chr1': [1, 5, 9, 12, 16]}, '-': {'chr3': [0, 4, 7, 10, 13, 15], 
           'chr2': [2, 5, 9, 13, 17], 'chr1': [0]}}

test_d在脚本后面的大型Illumina排序数据集中定义了一组要检查的位置。

我的第一次尝试使用了enumerate和迭代。

import time
import numpy as np

rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
test_d = { '+' : {}, '-' : {}}
nts = 'T'

s = time.time()
for chrom in seq_d:
    plus_pos, minus_pos = [], []
    chrom_seq = seq_d[chrom]
    for pos, nt in enumerate(chrom_seq):
        if nt in nts:
            plus_pos.append(pos)
        if rev_comps[nt] in nts:
            minus_pos.append(pos)

    test_d['+'][chrom] = plus_pos
    test_d['-'][chrom] = minus_pos
e = time.time()
print 'The serial version took {} minutes...'.format((e-s)/60)

酵母产量：

The serial version took 0.0455190300941 minutes...

人类的产出：

The serial version took 10.1694028815 minutes...

我尝试使用numpy数组而不是enumerate()和迭代：

s = time.time()
for chrom in seq_d:
    chrom_seq = np.array(list(seq_d[chrom]))
    nts = list(nts)
    rev_nts = [rev_comps[nt] for nt in nts]
    plus_pos = list(np.where(np.in1d(chrom_seq, nts) == True)[0])
    minus_pos = list(np.where(np.in1d(chrom_seq, rev_nts) == True)[0])

    test_d['+'][chrom] = plus_pos
    test_d['-'][chrom] = minus_pos
e = time.time()
print 'The numpy version took {} minutes...'.format((e-s)/60)

酵母产量：

The numpy version took 0.0309354345004 minutes...

人类的产出：

The numpy version took 9.86174853643 minutes...

为什么numpy版本失去了对更大的人类基因组的性能优势？有更快的方法吗？我尝试使用multiprocessing.Pool实现一个版本，但这比这两个版本都慢：

def getTestHelper(chrom_seq, nts, rev_comp):

    rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
    if rev_comp:
        nts = [rev_comps[nt] for nt in nts]
    else:
        nts = list(nts)
    chrom_seq = np.array(list(chrom_seq))
    mask = np.in1d(chrom_seq, nts)
    pos_l = list(np.where(mask == True)[0])
    return pos_l

s = time.time()
pool = Pool(4)
plus_pos = pool.map(functools.partial(getTestHelper, nts=nts, rev_comp=False), seq_d.values())
minus_pos = pool.map(functools.partial(getTestHelper, nts=nts, rev_comp=True), seq_d.values())
e = time.time()
print 'The parallel version took {} minutes...'.format((e-s)/60)

我还没有在人类基因组上运行过这个，但是酵母版本要慢一些：

The parallel version took 0.652778700987 minutes...

python

回答 3

Stack Overflow用户

回答已采纳

发布于 2016-09-04 00:04:33

老实说，我认为你做的事是对的。

不过，您还可以对代码进行一些调整。通常，当性能是关键时，只在最里面的循环中执行最小的操作。纵观您的代码，在这方面仍然存在一些快速优化：

使用if...elif而不是if...if。
不要在不需要使用列表的地方使用列表--例如，对于nts和相反，只使用一个字符串就足够了。
不要对相同的结果进行多次评估--例如反向查找。

我猜您的多处理问题归结于这些非常大的字符串的序列化，抵消了并行运行可能带来的任何性能增益。但是，可能还有另一种方法--参见Numpy向量运算的并行化。我无法验证，因为我在安装numexpr时遇到了困难。

将它们放在一起并在此过程中尝试其他一些建议，我得到以下结果：

$ python test.py
Original serial took 0.08330821593602498 minutes...
Using sets took 0.09072601397832235 minutes...
Using built-ins took 0.061421032746632895 minutes...
Using regex took 0.11649663050969442 minutes...
Optimized serial took 0.05909080108006795 minutes...
Original numpy took 0.04050511916478475 minutes...
Optimized numpy took 0.03438538312911987 minutes...

我的代码如下。

import time
import numpy as np
from random import choice
import re

# Create single large chromosome for the test...
seq = ""
for i in range(10000000):
    seq += choice("ATGCN")
seq_d = {"Chromosome1": seq}


rev_comps = { 'A' : 'T', 'T' : 'A', 'G' : 'C', 'C' : 'G', 'N' : 'N'}
nts = 'T'

# Original serial implementation
def serial():
    test_d = { '+' : {}, '-' : {}}
    for chrom in seq_d:
        plus_pos, minus_pos = [], []
        chrom_seq = seq_d[chrom]
        for pos, nt in enumerate(chrom_seq):
            if nt in nts:
                plus_pos.append(pos)
            if rev_comps[nt] in nts:
                minus_pos.append(pos)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Optimized for single character tests
def serial2():
    test_d = {'+': {}, '-': {}}
    rev_nts = rev_comps[nts]
    for chrom in seq_d:
        plus_pos, minus_pos = [], []
        chrom_seq = seq_d[chrom]
        for pos, nt in enumerate(chrom_seq):
            if nt == nts:
                plus_pos.append(pos)
            elif nt == rev_nts:
                minus_pos.append(pos)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Use sets instead of lists
def set_style():
    test_d = { '+' : {}, '-' : {}}
    for chrom in seq_d:
        plus_pos, minus_pos = set(), set()
        chrom_seq = seq_d[chrom]
        for pos, nt in enumerate(chrom_seq):
            if nt in nts:
                plus_pos.add(pos)
            if rev_comps[nt] in nts:
                minus_pos.add(pos)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Use regex to find either nucleotide...
def regex_it():
    test_d = { '+' : {}, '-' : {}}
    search = re.compile("(T|A)")
    for chrom in seq_d:
        pos = 0
        plus_pos, minus_pos = [], []
        chrom_seq = seq_d[chrom]
        for sub_seq in search.split(chrom_seq):
            if len(sub_seq) == 0:
                continue
            if sub_seq[0] == 'T':
                plus_pos.append(pos)
            elif sub_seq[0] == 'A':
                minus_pos.append(pos)
            pos += len(sub_seq)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Use str.find instead of iteration
def use_builtins():
    test_d = { '+' : {}, '-' : {}}
    for chrom in seq_d:
        plus_pos, minus_pos = [], []
        chrom_seq = seq_d[chrom]
        start = 0
        while True:
            pos = chrom_seq.find("T", start)
            if pos == -1:
                break
            plus_pos.append(pos)
            start = pos + 1

        start = 0
        while True:
            pos = chrom_seq.find("A", start)
            if pos == -1:
                break
            minus_pos.append(pos)
            start = pos + 1

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Original numpy implementation
def numpy1():
    test_d = { '+' : {}, '-' : {}}
    for chrom in seq_d:
        chrom_seq = np.array(list(seq_d[chrom]))
        for_nts = list(nts)
        rev_nts = [rev_comps[nt] for nt in nts]
        plus_pos = list(np.where(np.in1d(chrom_seq, for_nts) == True)[0])
        minus_pos = list(np.where(np.in1d(chrom_seq, rev_nts) == True)[0])

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

# Optimized for single character look-ups
def numpy2():
    test_d = {'+': {}, '-': {}}
    rev_nts = rev_comps[nts]
    for chrom in seq_d:
        chrom_seq = np.array(list(seq_d[chrom]))
        plus_pos = np.where(chrom_seq == nts)
        minus_pos = np.where(chrom_seq == rev_nts)

        test_d['+'][chrom] = plus_pos
        test_d['-'][chrom] = minus_pos

for fn, name in [
        (serial, "Original serial"),
        (set_style, "Using sets"),
        (use_builtins, "Using built-ins"),
        (regex_it, "Using regex"),
        (serial2, "Optimized serial"),
        (numpy1, "Original numpy"),
        (numpy2, "Optimized numpy")]:
    s = time.time()
    fn()
    e = time.time()
    print('{} took {} minutes...'.format(name, (e-s)/60))

票数 3

Stack Overflow用户

发布于 2016-09-03 20:44:17

使用内置

不要手动迭代您的长字符串，而是尝试str.find或str.index。不要自己分割字符串，使用这些方法的内置切片。

这也抛弃了enumerate-ing，尽管这不应该是昂贵的。

另外，您可以使用set来存储索引，而不是列表--添加可能更快。

不过，你得做两次才能找到你的核苷酸和它的补体。当然，在循环之外查找补码。

尝试正则表达式

您也可以尝试正则表达式来做同样的事情(如果您要尝试这一点，请同时尝试2 regex (用于"T“和"A")，另一种用于”T_x A“)。

而且，与其做

for chrom in seq_d:

你可以

for chromosome_number, chomosome in seq_d.items():

这与性能无关，但使代码更具可读性。

票数 3

Stack Overflow用户

发布于 2016-09-03 21:13:19

找到最长的染色体串，并创建一个空数组，每条染色体一行，并在字典中列到最长的。然后将每个染色体放在自己的行中，在那里可以在整个数组上调用np.where。

import numpy as np

longest_chrom = max([len(x) for x in seq_d.values()])
genome_data = np.zeros((len(seq_d), dtype=str, longest_chrom)
for index, entry in enumerate(seq_d):
    gen_string = list(seq_d[entry]
    genome_data[index, :len(gen_string) + 1] =  gen_string
nuc_search = "T"
nuc_locs = np.where(genome_data == nuc_search)

使用这种方法，对于>1核苷酸的字符串，它略有不同，但我确信，它只需要几个步骤就可以实现。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39310956

复制

相似问题

问在python中查找大字符串中所有子字符串的最快方法是什么？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中查找大字符串中所有子字符串的最快方法是什么？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中查找大字符串中所有子字符串的最快方法是什么？
EN