首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从蛋白质组fasta文件中计算蛋白质的平均长度

从蛋白质组fasta文件中计算蛋白质的平均长度
EN

Code Review用户
提问于 2018-10-29 15:58:54
回答 2查看 450关注 0票数 2

这个代码计算的是一些微生物蛋白质组的平均蛋白质长度(氨基酸数)。

从目前的情况来看,代码需要很长时间才能运行,我认为我在某个地方效率很低,但不能100%确定在哪里运行。

我运行代码的方式是使用bash-循环(对每个特定文件类型的示例(如*.faa)使用find )如下:

代码语言:javascript
复制
for FAA in $(find . -name "*.faa")
do
    python proteinlengthgen.py $FAA
done

My代码:

代码语言:javascript
复制
from Bio import SeqIO
import sys
from collections import Counter

def species_name_function(infile):
    for record in SeqIO.parse(infile, "fasta"):
        speciesname = record.description.split('[', 1)[1].split(']', 1)[0]
    return speciesname

if __name__ == '__main__':
    frequency = []
    infile = sys.argv[1]
    avgproteinlength = 0
    totalproteinlength = 0
    fastacounter = 0

    for fasta in SeqIO.parse(open(infile), "fasta"):
        dna = str(fasta.seq)
        freq = Counter(dna)
        fastacounter = fastacounter + 1
        frequency.append(freq)
        species_name = species_name_function(infile)

        genomesize = freq['G'] + freq['A'] + freq['L'] + freq['M'] + freq['F'] + freq['W'] + freq['K'] + freq['Q'] + freq['E'] + freq['S'] + freq['P'] + freq['V'] + freq['I'] + freq['C'] + freq['Y'] + freq['H'] + freq['R'] + freq['N'] + freq['D'] + freq['T']

        listofbases = ['G', 'A', 'L', 'M', 'F', 'W', 'K', 'Q', 'E', 'S', 'P', 'V', 'I', 'C', 'Y', 'H', 'R', 'N', 'D', 'T']

        totalproteinlength = genomesize + totalproteinlength

    avgproteinlength = totalproteinlength / fastacounter

    towrite = str(avgproteinlength) + '\t' + species_name + '\t' + '\n'

    with open("proteinlength.csv", "a") as myfile:
        myfile.write(towrite)

输入文件子集的Example (我不能使用整个文件,因为它们非常大):

代码语言:javascript
复制
>WP_013179448.1 DNA-directed RNA polymerase [Methanococcus voltae]
MYKILTIEDTIRIPPKMFGNPLKDNVQKVLMEKYEGILDKDLGFILAIEDIDQISEGDIIYGDGAAYHDTTFNILTYEPE
VHEMIEGEIVDIVEFGAFIRLGPLDGLIHISQVMDDYVAFDPQREAIIGKETGKVLEKGDKVRARIVAVSLKEDRKRGSK
IALTMRQPALGKLEWLEDEKLETMENAEF
>WP_013179449.1 DNA-directed RNA polymerase subunit E'' [Methanococcus voltae]
MARKGLKACTKCNYITHDDFCPICQHETSENIRGLLIILDPVNSEVAKIAQKDIKGKYALSVK
>WP_013179451.1 30S ribosomal protein S24e [Methanococcus voltae]
MDIKVVSEKNNPLLGRKEVKFALKYEGATPAVKDVKMKLVAILNANKELLVIDELAQEFGKMEANGYAKIYESEEAMNSI
EKKSIIEKNKIVEEAEEAQE
>WP_013179452.1 30S ribosomal protein S27ae [Methanococcus voltae]
MAQKTKKSDYYKIDGDKVERLKKSCPKCGEGVFMAEHLNRFACGKCGYMEYKKNEKAEKEE
>WP_013179453.1 hypothetical protein [Methanococcus voltae]
MNELNELKNPDKIDGKNNNTKNNNNNNNKDSNTENSITEIIKADNETQDNLSDLCVLEDIKTLKSKYKVYKTSKYLTKKD
INNIIEKDYDEIIMPQSIYKLLNEKNKSSMEKLRLCGIIVKTTDNVGRPKKITKYDKDKIKELLVDGKSVRKTAEIMDMK
KTTVWENIKDCMNEIKIEKFRKMIYEYKELLIMQERYGSYVESLFLELDIYINNEDMENALEILNKIIIYVKSEDKKD
>WP_013179454.1 integrase [Methanococcus voltae]
MKNKRINNNQKSKWETMRTDVINTQRNQNINSKNKQYRVKKHYCKEWLTKEELKVFIETIEYSEHKLFFKMLYGMALRVS
ELLKIKVQDLQLKEGVCKLWDTKTDYFQVCLIPDWLINDIKEYIALKSLDSSQELFKFNNRKYVWELAKKYSKMAELDKD
ISTHTFRRSRALHLLNDGVPLEKVSKYLRHKSIGTTMSYIRITVVDLKQELDKIDDWYEL
>WP_013179455.1 hypothetical protein [Methanococcus voltae]
MNTQNAIKKTLKTSKVNKNISNVIIGYSAILLDTYSNNKNLLLVKYDKLFKGFLNSSSITEKQYNKLYDTVLNSLF
>WP_013179456.1 hypothetical protein [Methanococcus voltae]
MVVKLVKISNGGYVSSLELKRINDIILSQLTNEFTIKDIVNMYSNKYDDCNNNAIAQKTRRLLNNHIESGVFTVRNALKN
KKIYKFKDVFVPASAGDTNTSLLFYSTSMKNSNHIEKQKKNNNKYNTNVNKPTITPDQIRVMAGIVNNPQIKSLKKERFK
SILHLNCKHMLNEEDRTELLENFKEYIIKASSQNLVLERTRYHKNKPKYITFPYLTRFTNSKQLKRQLAQYNCIFEQKAI
KYNRGVHLTLTTDPKLFRNIYEANKHFSKAFNRFMSFLSKRNKDVNGKSRRPVYLAAYEYTKSGLCHAHILIFGKKYLLD
QRVITQEWSRCGQGQIAHIYSIRRDGINWIYGRARPEEIQTGEYKNANEYLKKYLVKSLYSELDGSMYWAMNKRFFTFSK
SLKPAMMKRPKPEKYYKFVGIWTDEEFTDEINQAFKTGIPYDEIKKIHWKNLNNGLSCG
>WP_013179457.1 hypothetical protein [Methanococcus voltae]
MVRGRYPVFSGFKKFNKINLGKEKRNEGVYKYYNQDKTLLYVGVSNEVKLRLLSAYYGRSDYAVLENKKKLRQNIAYYKV
KYCGLDQARKIEHRIKKQCKYNLN

代码的<#>Output:

代码语言:javascript
复制
302.7408088235294   Methanococcus voltae
EN

回答 2

Code Review用户

回答已采纳

发布于 2018-10-30 08:46:45

加速处理的第一件事是只读取文件一次。species_name_function读取整个文件,并被多次调用。提取解析它的行并将其插入主循环中:

代码语言:javascript
复制
    for fasta in SeqIO.parse(open(infile), "fasta"):
        species_name = fasta.description.split('[', 1)[1].split(']', 1)[0]

现在,

dna = str(fasta.seq) freq = Counter(dna)

做不必要的工作。没有必要在计数前转换成字符串。

listofbases是未使用的,而genomesize是由一个很大的显式和计算的。我最好的猜测是你在做实验,试着想出

代码语言:javascript
复制
        genomesize = sum(freq[base] for base in "GALMFWKQESPVICYHRNDT")

使用这些更改并应用一些我认为不需要单独解释的进一步简化,我得到了以下信息:

代码语言:javascript
复制
from Bio import SeqIO
import sys
from collections import Counter

if __name__ == '__main__':
    infile = sys.argv[1]
    genomesizes = []

    for fasta in SeqIO.parse(open(infile), "fasta"):
        species_name = fasta.description.split('[', 1)[1].split(']', 1)[0]
        freq = Counter(fasta.seq)
        genomesizes.append(sum(freq[base] for base in "GALMFWKQESPVICYHRNDT"))

    avgproteinlength = sum(genomesizes) / len(genomesizes)
    towrite = str(avgproteinlength) + '\t' + species_name + '\t' + '\n'

    with open("proteinlength.csv", "a") as myfile:
        myfile.write(towrite)

但是,我不太相信显式输出文件名。此外,我倾向于通过支持多个文件来简化使用:

代码语言:javascript
复制
from Bio import SeqIO
import sys
from collections import Counter

if __name__ == '__main__':
    for infile in sys.argv[1:]:
        genomesizes = []

        for fasta in SeqIO.parse(open(infile), "fasta"):
            species_name = fasta.description.split('[', 1)[1].split(']', 1)[0]
            freq = Counter(fasta.seq)
            genomesizes.append(sum(freq[base] for base in "GALMFWKQESPVICYHRNDT"))

        avgproteinlength = sum(genomesizes) / len(genomesizes)
        print(str(avgproteinlength) + '\t' + species_name)

被称为

代码语言:javascript
复制
find . -name "*.faa" | xargs python proteinlengthgen.py >>proteinlength.csv
票数 4
EN

Code Review用户

发布于 2018-10-29 17:12:26

在Python中,不需要在使用变量之前初始化它们。因此,不需要avgproteinlength = 0

Python还有一个正式的样式指南,PEP8,它推荐使用lower_case_with_underscores作为变量和函数名,因此应该将其命名为average_protein_length

当我们重命名事物时,species_name_function可能更好地命名为get_species_name

现在,Bio.Seq对象实际上支持获取它们的长度,因此可以大大降低主块的复杂性(这也会减少运行时的复杂度)。你也可以从循环中提取物种的名字(它应该保持不变,如果没有,你现在也没有考虑到这一点)。您也不使用frequency变量,也不使用listofbases

代码语言:javascript
复制
from statistics import mean

if __name__ == '__main__':
    infile = sys.argv[1]
    species_name = get_species_name(infile)
    genomes = (fasta.seq for fasta in SeqIO.parse(open(infile), "fasta")) # a generator
    average_protein_length = mean(len(seq) for seq in genomes)

    with open("proteinlength.csv", "a") as myfile:
        myfile.write(f"{average_protein_length}\t{species_name}\t\n")

我使用了Python 3.4+ statistics.mean函数和Python 3.6+ F-字符串

当然,这假设你用手算出的基因组长度与基因组序列的长度是一样的。

票数 3
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/206516

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档