我刚接触Biopython (和一般的编码),正在尝试编写一种方法,在一个单独的FASTA文件中将一系列DNA序列(超过80个)转换为蛋白质序列。我还想在正确的阅读框架中找到序列。
这是我到目前为止所知道的:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
protein_id = record.id
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3
def prot_record(record):
return SeqRecord(seq = protein, \
id = ">" + protein_id, \
description = "translated sequence")
records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")我当前代码的问题是,虽然它似乎可以工作,但它只给出了输入文件的最后一个序列。有人能帮我弄清楚怎么写所有的序列吗?
谢谢!
发布于 2018-03-15 15:39:55
正如其他人所提到的,在尝试写入结果之前,您的代码将遍历整个输入。我想建议如何使用流媒体方法来实现这一点:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
with open("AAseq.fasta", 'w') as aa_fa:
for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
# use both fwd and rev sequences
dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]
# generate all translation frames
aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)
# select the longest one
max_aa = max(aa_seqs, key=len)
# write new record
aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
SeqIO.write(aa_record, aa_fa, 'fasta')这里的主要改进是:
帧的支持,这些帧是通过生成器理解创建的,并且只存储最长长度的帧。
if...elif...else来避免max结构。发布于 2018-03-05 21:05:36
您的if在for循环之外,因此它只应用一次,使用变量和它们在上一次循环迭代结束时的值。如果您希望if在每次迭代时都发生,则需要将其缩进到与之前的代码相同的级别:
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
protein_id = record.id
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
# Same indentation level, still in the loop
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3函数prot_record使用protein和protein_id的当前值,这也是它们在for循环的最后一次迭代结束时的值。
如果我猜对了你想要什么,一种可能是把这个函数声明也放在循环中,以便函数有一个特定的行为,取决于循环的当前迭代,并将函数保存在一个列表中,供以后再次迭代记录时使用。不过,我不确定这是否有效:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
# List of functions:
record_makers = []
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
protein_id = record.id
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
# still in the loop
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3
# still in the loop
def prot_record(record):
return SeqRecord(seq = protein, \
id = ">" + protein_id, \
description = "translated sequence")
record_makers.append(prot_record)
# zip the functions and the records together instead of
# mapping one single function to all the records
records = [record_maker(record) for (
record_maker, record) in zip(
record_makers, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]另一种可能的方法是将翻译逻辑放入记录制作函数中:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
def find_translation(record):
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3
return protein
def prot_record(record):
protein = find_translation(record)
# By the way: no need for backslashes here
return SeqRecord(seq = protein,
id = ">" + record.id,
description = "translated sequence")
records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]这可能是更干净的。我还没测试过。
https://stackoverflow.com/questions/49073217
复制相似问题