我是Python的新手,已经花了很多时间来解决这个问题,希望有人能帮助我。我需要找到两个序列之间的重叠。重叠在第一个序列的左端和第二个序列的右端。我想让函数找到重叠部分,并返回它。
我的序列是:
s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"我的函数应该命名为
def getOverlap(left, right)其中s1是左序列,s2是右序列。
结果应该是
‘GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC’任何帮助我们都将不胜感激
发布于 2013-01-03 04:46:54
看一看difflib库,更准确地说是find_longest_match()
import difflib
def get_overlap(s1, s2):
s = difflib.SequenceMatcher(None, s1, s2)
pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2))
return s1[pos_a:pos_a+size]
s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"
print(get_overlap(s1, s2)) # GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC发布于 2013-01-03 04:45:31
您可以使用difflib.SequenceMatcher
d = difflib.SequenceMatcher(None,s1,s2)
>>> match = max(d.get_matching_blocks(),key=lambda x:x[2])
>>> match
Match(a=8, b=0, size=39)
>>> i,j,k = match
>>> d.a[i:i+k]
'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'
>>> d.a[i:i+k] == d.b[j:j+k]
True
>>> d.a == s1
True
>>> d.b == s2
True发布于 2013-01-03 04:43:50
Knuth-Morris-Pratt算法是一种很好的方法,可以在一个字符串中找到另一个字符串(自从我看到DNA以来,我猜你会希望这个算法能扩展到...数十亿?)。
# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002
from __future__ import generators
def KnuthMorrisPratt(text, pattern):
'''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''
# allow indexing into pattern and protect against change during yield
pattern = list(pattern)
# build table of shift amounts
shifts = [1] * (len(pattern) + 1)
shift = 1
for pos in range(len(pattern)):
while shift <= pos and pattern[pos] != pattern[pos-shift]:
shift += shifts[pos-shift]
shifts[pos+1] = shift
# do the actual search
startPos = 0
matchLen = 0
for c in text:
while matchLen == len(pattern) or \
matchLen >= 0 and pattern[matchLen] != c:
startPos += shifts[matchLen]
matchLen -= shifts[matchLen]
matchLen += 1
if matchLen == len(pattern):
yield startPosThe link where I got the KMP python code (和一个内置的,由于运行时常量,对于小问题会更快)。
为了获得最先进的性能,请使用前缀表和字符串的散列窗口作为基数4的整数(在生物学中,您可以将它们称为k-mer或oligos)。;)
祝好运!
编辑:还有一个很好的技巧,可以对包含第一个字符串中的每个前缀(总共n个)和第二个字符串中的每个前缀(总共n个)的列表进行排序。如果它们共享最大的公共子序列,那么它们在排序列表中必须是相邻的,所以从排序列表中最接近的另一个字符串中找到元素,然后取完全匹配的最长前缀。:)
https://stackoverflow.com/questions/14128763
复制相似问题