文章/答案/技术大牛

发布

社区首页 >问答首页 >如何优化我的FASTA解析器Python脚本以使它在slurm上运行得更快？

问如何优化我的FASTA解析器Python脚本以使它在slurm上运行得更快？
EN

Stack Overflow用户

提问于 2022-06-21 10:24:50

回答 1查看 62关注 0票数 1

我希望我贴在正确的地方？我的剧本在小基因组上运行得很好，但要与哺乳动物基因组一起工作需要几个小时和几天的时间。我尝试了许多不同的事情，但我没有想法。你能告诉我为什么这个剧本这么慢吗？这个脚本解析fasta文件，它与ID一起工作，开头是一个>或不是。它还计算N50和L50等基因组指标，最后返回一个包含seq及其长度的dict。

非常感谢!

from collections import OrderedDict
import argparse

parser = argparse.ArgumentParser(description = "N50 parser")

#parser = argparse.ArgumentParser(prog="N50Parser.py", usage="N50")

parser.add_argument("-i", "--input", action="store", dest="input", required=True,
                    help="Input file with sequences")
parser.add_argument("-o", "--output", action="store", dest="output",
                    help="output file")
parser.add_argument("-o2", "--output2", action="store", dest="output2",
                    help="output file")


args = parser.parse_args()



def read_file(fasta_file):
    "Parse fasta file"
    descriptiondict = OrderedDict()
    dictionnary = OrderedDict()
    with open(fasta_file, 'r') as infile:
        for line in infile:
            record = line.strip()
            if record and record[0] == '>':
                seqid = record.split(" ")[0][1:]
                print(seqid)
                dictionnary[seqid] = ""
                toto = record.split(" ", 1)
                if len(toto) >= 2 :
                    description = toto[1]
                    descriptiondict[seqid] = description
                else:
                    descriptiondict[seqid] = ""
                continue
            dictionnary[seqid] += record

    return dictionnary, descriptiondict

seqdict, descriptdict = read_file(args.input)




lengthdict = OrderedDict()
for sequenceid in seqdict:
    lengthdict[sequenceid] = len(seqdict[sequenceid])



length = sum(lengthdict.values())


N_number = sum([seqdict[seqid].count("N") for seqid in seqdict])

print(N_number)
print(length)
all_len = sorted(lengthdict.values(), reverse=True)
print(all_len)

if length > 0:
    acum = 0

    for y in range (len(all_len)):
        if acum <= length / 2:
            acum = all_len[y] + acum
            n = y #L50
        else:
            break


    n = n + 1
    print("The L50 is", n)
    print("The N50 is", all_len[n-1])

with open(args.output, 'w') as outfile:
    outfile.write("L50\t{}\n".format(n))
    outfile.write("N50\t{}\n".format(all_len[n-1]))


with open(args.output2, "w") as file:
    for key, value in lengthdict.items():
        file.write(f"{key} : {value}\n")

fasta

genome

python

parsing

bioinformatics

回答 1

Stack Overflow用户

发布于 2022-06-21 14:33:38

别再发明轮子了。使用建立良好的开放源码库来执行常见任务。在本例中，使用Biopython解析FASTA文件，例如：

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

使用pip或conda安装Biopython

conda create --channel bioconda --name biopython biopython

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72699134

复制

相似问题

问如何优化我的FASTA解析器Python脚本以使它在slurm上运行得更快？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何优化我的FASTA解析器Python脚本以使它在slurm上运行得更快？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何优化我的FASTA解析器Python脚本以使它在slurm上运行得更快？
EN