文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用Biopython翻译FASTA文件中的一系列DNA序列，并将蛋白质序列提取到单独的字段中？

问如何使用Biopython翻译FASTA文件中的一系列DNA序列，并将蛋白质序列提取到单独的字段中？
EN

Stack Overflow用户

提问于 2018-03-03 00:22:50

回答 2查看 3.5K关注 0票数 2

我刚接触Biopython (和一般的编码)，正在尝试编写一种方法，在一个单独的FASTA文件中将一系列DNA序列(超过80个)转换为蛋白质序列。我还想在正确的阅读框架中找到序列。

这是我到目前为止所知道的：

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)

if len(protein1) > len(protein2) and len(protein1) > len(protein3):
    protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
    protein = protein2
else:
    protein = protein3

def prot_record(record):
    return SeqRecord(seq = protein, \
             id = ">" + protein_id, \
             description = "translated sequence")

records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")

我当前代码的问题是，虽然它似乎可以工作，但它只给出了输入文件的最后一个序列。有人能帮我弄清楚怎么写所有的序列吗？

谢谢!

python

parsing

bioinformatics

biopython

fasta

回答 2

Stack Overflow用户

发布于 2018-03-15 15:39:55

正如其他人所提到的，在尝试写入结果之前，您的代码将遍历整个输入。我想建议如何使用流媒体方法来实现这一点：

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

with open("AAseq.fasta", 'w') as aa_fa:
    for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
        # use both fwd and rev sequences
        dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]

        # generate all translation frames
        aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)

        # select the longest one
        max_aa = max(aa_seqs, key=len)

        # write new record
        aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
        SeqIO.write(aa_record, aa_fa, 'fasta')

这里的主要改进是：

在每次迭代中转换并输出单个记录，从而最大限度地减少内存使用。
添加了对反向

帧的支持，这些帧是通过生成器理解创建的，并且只存储最长长度的帧。

通过使用带键的if...elif...else来避免max结构。

票数 6

Stack Overflow用户

发布于 2018-03-05 21:05:36

您的if在for循环之外，因此它只应用一次，使用变量和它们在上一次循环迭代结束时的值。如果您希望if在每次迭代时都发生，则需要将其缩进到与之前的代码相同的级别：

for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)
    # Same indentation level, still in the loop
    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3

函数prot_record使用protein和protein_id的当前值，这也是它们在for循环的最后一次迭代结束时的值。

如果我猜对了你想要什么，一种可能是把这个函数声明也放在循环中，以便函数有一个特定的行为，取决于循环的当前迭代，并将函数保存在一个列表中，供以后再次迭代记录时使用。不过，我不确定这是否有效：

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

# List of functions:
record_makers = []
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)
    # still in the loop
    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3
    # still in the loop
    def prot_record(record):
        return SeqRecord(seq = protein, \
                 id = ">" + protein_id, \
                 description = "translated sequence")
    record_makers.append(prot_record)

# zip the functions and the records together instead of
# mapping one single function to all the records
records = [record_maker(record) for (
    record_maker, record) in zip(
        record_makers, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]

另一种可能的方法是将翻译逻辑放入记录制作函数中：

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

def find_translation(record):
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)

    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3
    return protein

def prot_record(record):
    protein = find_translation(record)
    # By the way: no need for backslashes here
    return SeqRecord(seq = protein,
                     id = ">" + record.id,
                     description = "translated sequence")

records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]

这可能是更干净的。我还没测试过。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49073217

复制

相似问题

问如何使用Biopython翻译FASTA文件中的一系列DNA序列，并将蛋白质序列提取到单独的字段中？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Biopython翻译FASTA文件中的一系列DNA序列，并将蛋白质序列提取到单独的字段中？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Biopython翻译FASTA文件中的一系列DNA序列，并将蛋白质序列提取到单独的字段中？
EN