我正在使用Biopython在fasta文件中查找与包含选定ID的.txt文件中的ID匹配的序列。当手动搜索fasta文件中的ID名称时,我确实得到了匹配结果,但以下脚本没有找到/提取任何序列:
#!/usr/bin/env python3
from Bio import SeqIO
wanted_ids = "transcript.orthogroup7.txt"
input_filename = "hq_isoseq_transcripts.fasta"
output_filename = "wanted_hq_isoseq_transcripts.fasta"
count = 0
total = 0
output_handle = open(output_filename, "w")
for record in SeqIO.parse(input_filename, "fasta"):
total = total + 1
if record.id in wanted_ids:
count = count + 1
SeqIO.write(record, output_handle, "fasta")
output_handle.close()
print(str(count) + " records selected out of " + str(total))下面是.fasta文件的摘录
>20190807-P1_PBI19-RO01_8pM_HQ_transcript/0
GGTACGTATAGGAATTATAAGGGCTCATTGAGACAGTCTGAGCAAAAAAGTCTCGTTCAG
GCTCGCTAAAAGCCTTTGGTGACGCATCTGTCCGTGAGCGAGTCACATTTTGTGAGCGAG
TCACGAATCTCGGCGCCGGCGAATCATGAATCTCCGTAACAGTGAGTCGCGAATCTCCTA
AGCCGTGTGTTTTGCGCACGTCACTTCATTCCTCGGTACAGGGCTTCCTCTTAACAGGAA ... and so on
>20190807-P1_PBI19-RO01_8pM_HQ_transcript/1
GACTATATAGGAATTATAAGGGCTCATTGAGACAGTCTGAGCAAAAAAGTCTCGTTCAGG
CTCGCTAAAAGCCTTTGGTGACGCATCTGTCCGTGAGCGAGTCACATTTTGTGAGCGAGT
CACGAATCTCGGCGCCGGCGAATCATGAATCTCCGTAACAGTGAGTCGCGAATCTCCTAA
GCCGTGTGTTTTTGCGCACGTCACTTCATTCCTCGGTACAGGGCTTCCTCTCAGCAGGAA ... and so on 和我的ID .txt文件
>20190807-P1_PBI19-RO01_8pM_HQ_transcript/0
>20190807-P1_PBI19-RO01_8pM_HQ_transcript/1我在conda环境中执行该脚本,得到以下意外消息
(biopython_transcripts) $ python3 script.py
0 records selected out of 35629例如,当将"count“和"total”更改为1000时,我得到
(biopython_transcripts) $ python3 script.py
1000 records selected out of 35629所以脚本运行,但不提取任何序列,尽管我确信序列存在,并且.txt和.fasta文件中的名称完全匹配。
我对Biopython以外的其他解决方案持开放态度,但如果能帮助我理解为什么这不起作用,以及可能的解决方案,我会非常感激。
发布于 2021-11-17 10:45:18
您需要解析具有所需ids的文件,并构建要保留的ids列表。目前,您的代码根据所需ids的文件名检查ids (即,只有当ids是transcript.orthogroup7.txt的子字符串时才匹配)
以下是您的代码的修改:
from Bio import SeqIO
wanted_filename = "transcript.orthogroup7.txt" ## changed name
input_filename = "hq_isoseq_transcripts.fasta"
output_filename = "wanted_hq_isoseq_transcripts.fasta"
count = 0
total = 0
output_handle = open(output_filename, "w")
with open(wanted_filename) as f: ## added lines to
wanted_ids = [l[1:-1] for l in f] ## build list of ids
## now wanted_ids = ['20190807-P1_PBI19-RO01_8pM_HQ_transcript/0', '20190807-P1_PBI19-RO01_8pM_HQ_transcript/1']
for record in SeqIO.parse(input_filename, "fasta"):
total = total + 1
if record.id in wanted_ids:
count = count + 1
SeqIO.write(record, output_handle, "fasta")
output_handle.close()
print(str(count) + " records selected out of " + str(total))https://stackoverflow.com/questions/70002399
复制相似问题