我已经被困在这一点上很长一段时间了,希望能得到一些提示。
这个问题可以简化为找出字符串中模式的最大连续出现。作为一个模式AATG,对于像ATAATGAATGAATGGAATG这样的字符串,正确的结果应该是3。我尝试使用re.compile()来计算模式的出现次数。我从文档中发现,如果我想要找到连续出现的模式,我可能必须使用特殊字符+。例如,像AATG这样的模式,我必须使用re.compile(r'(AATG)+')而不是re.compile(r'AATG')。否则,出现的次数将被过多计算。然而,在这个程序中,模式不是固定的字符串。我把它当作一个变量。我尝试了许多方法将其放入re.compile()中,但没有取得积极的结果。谁能告诉我格式化它的正确方法(在下面的函数def countSTR中)?
在此之后,我认为finditer(the_string_to_be_analysis)应该返回一个迭代器,包括找到的所有匹配项。然后,我使用match.end() - match.start()来获取每个匹配的长度,以便相互比较,以便获得模式连续出现的最长时间。也许那里出了什么问题?
附加代码。我们将非常感谢您的每一条建议!
from sys import argv, exit
import csv
import re
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
# read DNA sequence
with open(argv[2], "r") as file:
if file.mode != 'r':
print(f"database {argv[2]} can not be read")
exit(1)
sequence = file.read()
# read database.csv
with open(argv[1], newline='') as file:
if file.mode != 'r':
print(f"database {argv[1]} can not be read")
exit(1)
# get the heading of the csv file in order to obtain STRs
csv_reader = csv.reader(file)
headings = next(csv_reader)
# dictionary to store STRs match result of DNA-sequence
STR_counter = {}
for STR in headings[1::]:
# entry result accounting to the STR keys
STR_counter[STR] = countSTR(STR, sequence)
# read csv file as a dictionary
with open(argv[1], newline='') as file:
database = csv.DictReader(file)
for row in database:
count = 0
for STR in STR_counter:
# print("row in database ", row[STR], "STR in STR_counter", STR_counter[STR])
if int(row[STR]) == int(STR_counter[STR]):
count += 1
if count == len(STR_counter):
print(row['name'])
exit(0)
else:
print("No match")
# find non-overlapping occurrences of STR in DNA-sequence
def countSTR(STR, sequence):
count = 0
maxcount = 0
# in order to match repeat STR. for example: "('AATG')+" as pattern
# into re.compile() to match repeat STR
# rewrite STR to "(STR)+"
STR = "(" + STR + ")+"
pattern = re.compile(r'STR')
# matches should be a iterator object
matches = pattern.finditer(sequence)
# go throgh every repeat and find the longest one
# by match.end() - match.start()
for match in matches:
count = match.end() - match.start()
if count > maxcount:
maxcount = count
# return repeat times of the longest repeat
return maxcount/len(STR)
main()发布于 2020-06-23 16:18:24
只要找到一种正确的方法就可以得到想要的结果。把它贴在这里,以防其他人也感到困惑。据我所知,要匹配一个名为var_pattern的变量,可以使用re.compile(rf'{var_pattern}')。然后,如果需要搜索连续出现的var_pattern,则可以使用re.compile(rf'(var_pattern)+')。可能还有其他更聪明的方法来实现它,但是我设法让它像以前一样工作得很好。
https://stackoverflow.com/questions/62513515
复制相似问题