我希望我贴在正确的地方?我的剧本在小基因组上运行得很好,但要与哺乳动物基因组一起工作需要几个小时和几天的时间。我尝试了许多不同的事情,但我没有想法。你能告诉我为什么这个剧本这么慢吗?这个脚本解析fasta文件,它与ID一起工作,开头是一个>或不是。它还计算N50和L50等基因组指标,最后返回一个包含seq及其长度的dict。
非常感谢!
from collections import OrderedDict
import argparse
parser = argparse.ArgumentParser(description = "N50 parser")
#parser = argparse.ArgumentParser(prog="N50Parser.py", usage="N50")
parser.add_argument("-i", "--input", action="store", dest="input", required=True,
help="Input file with sequences")
parser.add_argument("-o", "--output", action="store", dest="output",
help="output file")
parser.add_argument("-o2", "--output2", action="store", dest="output2",
help="output file")
args = parser.parse_args()
def read_file(fasta_file):
"Parse fasta file"
descriptiondict = OrderedDict()
dictionnary = OrderedDict()
with open(fasta_file, 'r') as infile:
for line in infile:
record = line.strip()
if record and record[0] == '>':
seqid = record.split(" ")[0][1:]
print(seqid)
dictionnary[seqid] = ""
toto = record.split(" ", 1)
if len(toto) >= 2 :
description = toto[1]
descriptiondict[seqid] = description
else:
descriptiondict[seqid] = ""
continue
dictionnary[seqid] += record
return dictionnary, descriptiondict
seqdict, descriptdict = read_file(args.input)
lengthdict = OrderedDict()
for sequenceid in seqdict:
lengthdict[sequenceid] = len(seqdict[sequenceid])
length = sum(lengthdict.values())
N_number = sum([seqdict[seqid].count("N") for seqid in seqdict])
print(N_number)
print(length)
all_len = sorted(lengthdict.values(), reverse=True)
print(all_len)
if length > 0:
acum = 0
for y in range (len(all_len)):
if acum <= length / 2:
acum = all_len[y] + acum
n = y #L50
else:
break
n = n + 1
print("The L50 is", n)
print("The N50 is", all_len[n-1])
with open(args.output, 'w') as outfile:
outfile.write("L50\t{}\n".format(n))
outfile.write("N50\t{}\n".format(all_len[n-1]))
with open(args.output2, "w") as file:
for key, value in lengthdict.items():
file.write(f"{key} : {value}\n")发布于 2022-06-21 14:33:38
别再发明轮子了。使用建立良好的开放源码库来执行常见任务。在本例中,使用Biopython解析FASTA文件,例如:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))使用pip或conda安装Biopython
conda create --channel bioconda --name biopython biopythonhttps://stackoverflow.com/questions/72699134
复制相似问题