我正试图从Stepic课程中解决一个生物信息学问题。
提出的问题是:在较长的基因组中找到相同模式的簇。
动机:在基因组长度为500的窗口内识别3种相同的模式,使生物学家能够找到细菌基因组的“复制来源”。这种复制来源是细菌克隆自身所必需的,因此它是非常重要的。
使用Python,我编写了一个解决方案,它可以有效地处理小的测试数据集,但对现实生活中的基因组没有效率。下面我描述了使用的方法。
我使用了两个核心功能。
此函数在较长的字符串中查找短模式的所有起始位置。
def patternOccurances2(text="GATATATGCATATACTT", pattern="ATAT"):
occurances = []
position = -1
position = config.genome.find(pattern, position+1)
occurances.append(position)
while position != -1:
position = text.find(pattern, position+1)
occurances.append(position)
return occurances[:-1]第二个函数在字符串中查找长度k的所有模式,这些模式至少会发生t次数;或者,如果省略t,则只查找长度k中最常见的模式。
def frequentPatterns(text="", k=2,numberOfOccurances=None):
#Frequent Words Problem: Find the frequent k-mers in a string.
# Input: A string text, and an integer k, number of occurances t
# Output: All most frequent k-mers in Text.
words = {}
frequentPatterns = []
#create a dictionary of all patterns of length k in text with their respective frequencies
for i in range(len(text)-k + 1): #iterate over the valid length of the string
a = text[i:i+k]
if a in words:
words[a] = words[a] + 1
else:
words[a] = 1
if numberOfOccurances == None: #gives only the most frequent strings of length k
largestValue=0
for k, v in words.iteritems():
if v > largestValue:
frequentPatterns = []
frequentPatterns.append(k)
largestValue = v
if v == largestValue:
frequentPatterns.append(k)
else: #gives all strings of length k that occur numberOfOccurances times
for k, v in words.iteritems():
if v >= numberOfOccurances:
frequentPatterns.append(k)
return frequentPatterns主代码按以下方式调用上述函数:
#Clump Finding Problem: Find patterns forming clumps in a string.
# Input: A long string Genome, and integers k, L, and t.
# k is the length of the pattern we wish to match.
# t is the minimum number of times we wish to find this pattern in a portion of length L in the Genome.
# Output: All distinct k-mers forming (L, t)-clumps in Genome.
f = open("E-coli.txt",'r') #input from a large file, 4.2MB
genome = f.read()
k = 9
L = 500
t = 3
#find all patterns of length k in genome that occur at least t times
frequentPatterns = ex213.frequentPatterns(text=genome,k=k,numberOfOccurances=t)
#for each pattern in frequentPatterns, find the locations of the patterns in the genome
#then find if any 3 of these patterns are within distance L of each other
for pattern in frequentPatterns:
locations = ex322.patternOccurances("",pattern)
for location in range(len(locations) - 2):
if locations[location + 1] - locations[location] <= L:
if locations[location + 2] - locations[location] <= L:
print pattern #3 of the pattern are sufficiently close to each other
break
f.close()测量表明,frequentPatterns调用完成得很快。此调用只发生一次,因此优化它不会显著提高代码的速度。
然而,外部for循环需要大量时间来完成(按几个小时的顺序排列)。特别是,patternOccurances调用特别慢。但是,我确信代码中还存在其他效率低下的问题。
其他程序员在40年代内完成了同样的任务。
对于感兴趣的人,可以找到这里 (4.2MB),这些算法应该有效地工作的大型数据集。
发布于 2013-12-23 04:44:01
“八元”应拼写为“事件”。
您的策略的核心是来自frequentPatterns()的这个循环:
for i in range(len(text)-k + 1): #iterate over the valid length of the string
a = text[i:i+k]
if a in words:
words[a] = words[a] + 1
else:
words[a] = 1我建议使用collections.Counter来跟踪单词计数,因为这正是它的目的所在。
words = collections.Counter([text[i:i+k] for i in range(len(text) - k + 1)])获取所有这些子字符串可能是性能方面的问题。此外,您必须在稍后的patternOccurances()中找到这些子字符串。如果使用k≤16,您可能会通过将单词转换为数字来提高性能。由于有四种核酸酶,每种核酸酶只需2位即可存储。你应该能够将一个16位碱基字符串存储在一个32位的数字中.
我的尝试是:
from array import array
from collections import Counter
DNA_NUCLEOBASES = 'ACTG'
def numerify(s, k):
'''
Summarizes consecutive k-length substrings of s as as list of numbers.
s is a string consisting of letters from DNA_NUCLEOBASES.
k is an integer, 1 <= k <= 16.
'''
TRANS = dict(map(lambda c: (c, DNA_NUCLEOBASES.index(c)), DNA_NUCLEOBASES))
mask = 4 ** k - 1
# Depending on k, we might need a byte, short, or long.
array_type = '!BBBBHHHHLLLLLLLL'[k]
result = array(array_type, [0] * (len(s) - (k - 1)))
v = 0
for i in range(len(s)):
v = (v << 2 & mask) | TRANS[s[i]]
result[i - (k - 1)] = v
return result.tolist()
def stringify(n, k):
'''
Converts a number from numerify() back into a k-length string of
DNA_NUCLEOBASES.
'''
result = [DNA_NUCLEOBASES[(n >> 2 * i) & 3] for i in range(k)][::-1]
return ''.join(result)
def find_clumps(s, k):
'''
Lists all k-length substrings of s in descending frequency.
Returns a list of tuples of the substring and a list of positions at which
they occur.
'''
clumps = numerify(s, k)
result = []
for clump, occurrences in Counter(clumps).most_common():
if occurrences <= 1:
break
positions = []
i = -1
for _ in range(occurrences):
i = clumps.index(clump, i + 1)
positions.append(i)
result.append((stringify(clump, k), positions))
return result
s = 'CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA'
k = 5
for clump, positions in find_clumps(s, k):
print("%s occurs at %s" % (clump, positions))我让你来实现L长的窗户.
发布于 2014-10-30 14:26:27
这是我的版本,在我的测试中运行需要超过6秒的时间。
这个版本的主要目标是避免重复重复相同的文本。在更天真的实现中,您可以从基因组的第一个“L”窗口开始,在进入下一个“L”大小窗口之前检查其所有内容。
然而,这个窗口实际上只改变了两个字符--一个是从前面移除的,另一个是添加到末尾的。这意味着我们可以从第一个窗口保持子字符串计数,减少第一个子字符串的计数器(因为它不再存在),并为新的最终字符串增加计数器。这证明比反复迭代和计数要快得多。
进一步的改进可能是放弃存储L大小窗口的想法,以防未来的分析涉及到一个非常大的窗口,在该窗口中存储额外的副本将占用内存。
我的初始版本使用了计数器,但这比defaultdict慢,所以我删除了它。相关子字符串列表也更直观地存储为一个列表,但是当相关匹配的大小越来越大时,... in list的查找效果很差,因此存储为字典以获得更好的性能。
#!/usr/bin/env python
from collections import defaultdict
with open('E-coli.txt', 'r') as f:
genome = f.read().rstrip()
kmer = 9 # Length of the k-mer
windowsize = 500 # genome substring length to register clumps in
min_clumpsize = 3 # minimum number of repetitions of the k-mer
# find substrings at least kmer long that occur at least min_clumpsize
# times in a window of windowsize in genome
def get_substrings(g, k):
"""
Take the input genome window 'g', and produce a list of unique
substrings of length 'k' contained within it.
"""
substrings = list()
# Start from first character, split into 'k' size chunks
# Move along one character and repeat. No sense carrying on beyond
# a starting point of 'k' since that will be the first iteration again.
for i in range(k):
line = g[i:]
substrings += [line[i:i + k]
for i in range(0, len(line), k) if i + k <= len(line)]
# Using collections.Counter increases the runtime by about 3 seconds,
# during testing.
results = defaultdict(int)
for s in substrings:
results[s] += 1
return results
def find_clumps(genome, kmer, windowsize, clumpsize):
"""
In a given genome, examines each windowsize section for strings of length kmer
that occur at least clumpsize times.
Input:
genome: text string to search
kmer: length of string to search for
windowsize: size of the genome section to consider for clumping
clumpsize: the kmer length strings must occur at least this many times
Returns: a list of the strings that clump
"""
window = genome[0:windowsize]
# Initialise our counter, because the main algorithm can't start from
# scratch.
patterns = get_substrings(window, kmer)
# Using a dictionary not a list because the lookups are faster once the
# size of the object becomes large
relevant = {p: 1 for p in patterns if patterns[p] >= clumpsize}
starting_string = genome[0:kmer]
for i in range(windowsize, len(genome)):
# Move the window along one character
window = window[1:]
window += genome[i]
# This is the only string that can decrease if we've moved one
# character
patterns[starting_string] -= 1
starting_string = window[0:kmer]
# This is the only string that can increase if we've moved one
# character
ending_string = window[-kmer:]
patterns[ending_string] += 1
# if there are enough matches of the string at the end, add it to
# matches.
if patterns[ending_string] >= clumpsize and ending_string not in relevant:
relevant[ending_string] = 1
return list(relevant)
if __name__ == "__main__":
clumps = find_clumps(genome, kmer, windowsize, min_clumpsize)
print("Total: {}".format(len(clumps)))发布于 2013-12-23 08:21:20
在字符串中查找频繁发生的模式是在多个字符串中查找超过指定次数的所有模式的特殊情况。(只需将指定的次数设置为频繁的阈值,搜索字符串的数量为1)。
查找在多个字符串中查找超过指定次数的所有模式的地址为这里。这个链接包含C++和C++解决方案,这些解决方案的复杂性和速度都在不断提高,这些解决方案可以找到我需要在处理的字符串中找到的所有重复模式。我认为所有这些解决方案都可以很容易地适应您的具体问题。如果他们不能
我听说生物信息学家用压缩后缀树来解决这类问题。你可能会找到一些更快的解决办法,遵循这条路线的调查。
https://codereview.stackexchange.com/questions/37932
复制相似问题