这个代码计算的是一些微生物蛋白质组的平均蛋白质长度(氨基酸数)。
从目前的情况来看,代码需要很长时间才能运行,我认为我在某个地方效率很低,但不能100%确定在哪里运行。
我运行代码的方式是使用bash-循环(对每个特定文件类型的示例(如*.faa)使用find )如下:
for FAA in $(find . -name "*.faa")
do
python proteinlengthgen.py $FAA
doneMy代码:
from Bio import SeqIO
import sys
from collections import Counter
def species_name_function(infile):
for record in SeqIO.parse(infile, "fasta"):
speciesname = record.description.split('[', 1)[1].split(']', 1)[0]
return speciesname
if __name__ == '__main__':
frequency = []
infile = sys.argv[1]
avgproteinlength = 0
totalproteinlength = 0
fastacounter = 0
for fasta in SeqIO.parse(open(infile), "fasta"):
dna = str(fasta.seq)
freq = Counter(dna)
fastacounter = fastacounter + 1
frequency.append(freq)
species_name = species_name_function(infile)
genomesize = freq['G'] + freq['A'] + freq['L'] + freq['M'] + freq['F'] + freq['W'] + freq['K'] + freq['Q'] + freq['E'] + freq['S'] + freq['P'] + freq['V'] + freq['I'] + freq['C'] + freq['Y'] + freq['H'] + freq['R'] + freq['N'] + freq['D'] + freq['T']
listofbases = ['G', 'A', 'L', 'M', 'F', 'W', 'K', 'Q', 'E', 'S', 'P', 'V', 'I', 'C', 'Y', 'H', 'R', 'N', 'D', 'T']
totalproteinlength = genomesize + totalproteinlength
avgproteinlength = totalproteinlength / fastacounter
towrite = str(avgproteinlength) + '\t' + species_name + '\t' + '\n'
with open("proteinlength.csv", "a") as myfile:
myfile.write(towrite)输入文件子集的Example (我不能使用整个文件,因为它们非常大):
>WP_013179448.1 DNA-directed RNA polymerase [Methanococcus voltae]
MYKILTIEDTIRIPPKMFGNPLKDNVQKVLMEKYEGILDKDLGFILAIEDIDQISEGDIIYGDGAAYHDTTFNILTYEPE
VHEMIEGEIVDIVEFGAFIRLGPLDGLIHISQVMDDYVAFDPQREAIIGKETGKVLEKGDKVRARIVAVSLKEDRKRGSK
IALTMRQPALGKLEWLEDEKLETMENAEF
>WP_013179449.1 DNA-directed RNA polymerase subunit E'' [Methanococcus voltae]
MARKGLKACTKCNYITHDDFCPICQHETSENIRGLLIILDPVNSEVAKIAQKDIKGKYALSVK
>WP_013179451.1 30S ribosomal protein S24e [Methanococcus voltae]
MDIKVVSEKNNPLLGRKEVKFALKYEGATPAVKDVKMKLVAILNANKELLVIDELAQEFGKMEANGYAKIYESEEAMNSI
EKKSIIEKNKIVEEAEEAQE
>WP_013179452.1 30S ribosomal protein S27ae [Methanococcus voltae]
MAQKTKKSDYYKIDGDKVERLKKSCPKCGEGVFMAEHLNRFACGKCGYMEYKKNEKAEKEE
>WP_013179453.1 hypothetical protein [Methanococcus voltae]
MNELNELKNPDKIDGKNNNTKNNNNNNNKDSNTENSITEIIKADNETQDNLSDLCVLEDIKTLKSKYKVYKTSKYLTKKD
INNIIEKDYDEIIMPQSIYKLLNEKNKSSMEKLRLCGIIVKTTDNVGRPKKITKYDKDKIKELLVDGKSVRKTAEIMDMK
KTTVWENIKDCMNEIKIEKFRKMIYEYKELLIMQERYGSYVESLFLELDIYINNEDMENALEILNKIIIYVKSEDKKD
>WP_013179454.1 integrase [Methanococcus voltae]
MKNKRINNNQKSKWETMRTDVINTQRNQNINSKNKQYRVKKHYCKEWLTKEELKVFIETIEYSEHKLFFKMLYGMALRVS
ELLKIKVQDLQLKEGVCKLWDTKTDYFQVCLIPDWLINDIKEYIALKSLDSSQELFKFNNRKYVWELAKKYSKMAELDKD
ISTHTFRRSRALHLLNDGVPLEKVSKYLRHKSIGTTMSYIRITVVDLKQELDKIDDWYEL
>WP_013179455.1 hypothetical protein [Methanococcus voltae]
MNTQNAIKKTLKTSKVNKNISNVIIGYSAILLDTYSNNKNLLLVKYDKLFKGFLNSSSITEKQYNKLYDTVLNSLF
>WP_013179456.1 hypothetical protein [Methanococcus voltae]
MVVKLVKISNGGYVSSLELKRINDIILSQLTNEFTIKDIVNMYSNKYDDCNNNAIAQKTRRLLNNHIESGVFTVRNALKN
KKIYKFKDVFVPASAGDTNTSLLFYSTSMKNSNHIEKQKKNNNKYNTNVNKPTITPDQIRVMAGIVNNPQIKSLKKERFK
SILHLNCKHMLNEEDRTELLENFKEYIIKASSQNLVLERTRYHKNKPKYITFPYLTRFTNSKQLKRQLAQYNCIFEQKAI
KYNRGVHLTLTTDPKLFRNIYEANKHFSKAFNRFMSFLSKRNKDVNGKSRRPVYLAAYEYTKSGLCHAHILIFGKKYLLD
QRVITQEWSRCGQGQIAHIYSIRRDGINWIYGRARPEEIQTGEYKNANEYLKKYLVKSLYSELDGSMYWAMNKRFFTFSK
SLKPAMMKRPKPEKYYKFVGIWTDEEFTDEINQAFKTGIPYDEIKKIHWKNLNNGLSCG
>WP_013179457.1 hypothetical protein [Methanococcus voltae]
MVRGRYPVFSGFKKFNKINLGKEKRNEGVYKYYNQDKTLLYVGVSNEVKLRLLSAYYGRSDYAVLENKKKLRQNIAYYKV
KYCGLDQARKIEHRIKKQCKYNLN代码的<#>Output:
302.7408088235294 Methanococcus voltae发布于 2018-10-30 08:46:45
加速处理的第一件事是只读取文件一次。species_name_function读取整个文件,并被多次调用。提取解析它的行并将其插入主循环中:
for fasta in SeqIO.parse(open(infile), "fasta"):
species_name = fasta.description.split('[', 1)[1].split(']', 1)[0]现在,
dna = str(fasta.seq) freq = Counter(dna)
做不必要的工作。没有必要在计数前转换成字符串。
listofbases是未使用的,而genomesize是由一个很大的显式和计算的。我最好的猜测是你在做实验,试着想出
genomesize = sum(freq[base] for base in "GALMFWKQESPVICYHRNDT")使用这些更改并应用一些我认为不需要单独解释的进一步简化,我得到了以下信息:
from Bio import SeqIO
import sys
from collections import Counter
if __name__ == '__main__':
infile = sys.argv[1]
genomesizes = []
for fasta in SeqIO.parse(open(infile), "fasta"):
species_name = fasta.description.split('[', 1)[1].split(']', 1)[0]
freq = Counter(fasta.seq)
genomesizes.append(sum(freq[base] for base in "GALMFWKQESPVICYHRNDT"))
avgproteinlength = sum(genomesizes) / len(genomesizes)
towrite = str(avgproteinlength) + '\t' + species_name + '\t' + '\n'
with open("proteinlength.csv", "a") as myfile:
myfile.write(towrite)但是,我不太相信显式输出文件名。此外,我倾向于通过支持多个文件来简化使用:
from Bio import SeqIO
import sys
from collections import Counter
if __name__ == '__main__':
for infile in sys.argv[1:]:
genomesizes = []
for fasta in SeqIO.parse(open(infile), "fasta"):
species_name = fasta.description.split('[', 1)[1].split(']', 1)[0]
freq = Counter(fasta.seq)
genomesizes.append(sum(freq[base] for base in "GALMFWKQESPVICYHRNDT"))
avgproteinlength = sum(genomesizes) / len(genomesizes)
print(str(avgproteinlength) + '\t' + species_name)被称为
find . -name "*.faa" | xargs python proteinlengthgen.py >>proteinlength.csv发布于 2018-10-29 17:12:26
在Python中,不需要在使用变量之前初始化它们。因此,不需要avgproteinlength = 0。
Python还有一个正式的样式指南,PEP8,它推荐使用lower_case_with_underscores作为变量和函数名,因此应该将其命名为average_protein_length。
当我们重命名事物时,species_name_function可能更好地命名为get_species_name。
现在,Bio.Seq对象实际上支持获取它们的长度,因此可以大大降低主块的复杂性(这也会减少运行时的复杂度)。你也可以从循环中提取物种的名字(它应该保持不变,如果没有,你现在也没有考虑到这一点)。您也不使用frequency变量,也不使用listofbases。
from statistics import mean
if __name__ == '__main__':
infile = sys.argv[1]
species_name = get_species_name(infile)
genomes = (fasta.seq for fasta in SeqIO.parse(open(infile), "fasta")) # a generator
average_protein_length = mean(len(seq) for seq in genomes)
with open("proteinlength.csv", "a") as myfile:
myfile.write(f"{average_protein_length}\t{species_name}\t\n")我使用了Python 3.4+ statistics.mean函数和Python 3.6+ F-字符串。
当然,这假设你用手算出的基因组长度与基因组序列的长度是一样的。
https://codereview.stackexchange.com/questions/206516
复制相似问题