我有两个fasta文件,我想要匹配较短的序列,在FileB.fasta和原始序列是在FileA.fasta中,以获得它的坐标或位置。但我的输出格式不正确。有谁可以帮我?
FileA.fasta
>chr1:2000-2019
ACGTCGATCGGTCGACGTGCFileB.fasta
>chr1:2000-2019
GATCGGFileC.bed
chr1:2000-2019 6 11码
from Bio import SeqIO
output_file = open('fileC.bed','w')
for long_sequence_record in SeqIO.parse(open('fileA.fasta'), 'fasta'):
long_sequence = str(long_sequence_record.seq)
for short_sequence_record in SeqIO.parse(open('fileB.fasta'), 'fasta'):
short_sequence = str(short_sequence_record.seq)
if short_sequence in long_sequence:
start = long_sequence.index(short_sequence) + 1
stop = start + len(short_sequence) - 1
# print short_sequence_record.id, start, stop
output_line ='%s\t%i\t%i\n' % \
(short_sequence_record.id,start,stop)
output_file.write(output_line )
output_file.close()所需的FileC.bed输出
chr1 2005 2011发布于 2015-07-06 06:24:40
好吧,你在索引中加了一个1,在索引中找到了较短的序列-
start = long_sequence.index(short_sequence) + 1 <--- notice the +1别那么做,应该没事的。也不要对-1变量执行stop操作。
相反,您应该从id中添加起始序列号。
例子-
start = long_sequence.index(short_sequence) + int((short_sequence_record.id.split(':')[1].split('-')[0]))
stop = start + len(short_sequence)对于记录的id,如果在:之前不需要任何内容,则应该在:处拆分id并取左部分(拆分后为0索引字符串)。
例子-
output_line ='%s\t%i\t%i\n' % \
((short_sequence_record.id.split(':')[0]),start,stop)
output_file.write(output_line )发布于 2015-07-06 06:30:46
一个更普遍的解决办法:
https://stackoverflow.com/questions/31238793
复制相似问题