我编写这段代码是为了获得大量的爆炸结果,但是它看起来并不慢,因为我使用两个'for‘循环来迭代两个files.So,我想知道是否有一种更快、更贪婪的方法来缩小迭代范围。
这是密码
for tf_line in SeqIO.parse('deneme2.txt','fasta'):
tf_line.description=tf_line.description.split()
tempfile=open('tempfile.txt','w')
for cd_line in SeqIO.parse('Mus_musculus.GRCm38.74.cdna.all.fa','fasta'):
if cd_line.id==tf_line.description[1]:
tempfile.write('>'+cd_line.id+'\n'+
str(cd_line.seq)[int(tf_line.description[2])-100:
int(tf_line.description[3])+100])
tempfile.close()
os.system('makeblastdb -in tempfile.txt -dbtype nucl '
'-out tempfile.db -title \'tempfile\'')
cline = NcbiblastnCommandline(query='SRR029153.fasta' ,
db="tempfile.db",
outfmt=7,
out=(tf_line.description[0]+' '+
tf_line.description[1]))
stdout,stderr=cline()“deneme.txt”有30 Mb大,类似这样的东西:
999 029153.93098 ENSMUST00000103567 999 1147 151 029153.83280 ENSMUST00000181483 151 425 CAGGTTGAC ENSMUST00000184883 174 1415 TGGCACCTTTGC .
“Mus_musculus.GRCm38.74.cdna.all.fa”文件的大小为170 Mb,如下所示:
ENSMUST00000181483螨虫. ENSMUST00000184883 ATCTTTTTTCTTTCAGGG.
'Mus_musculus.GRCm38.74.cdna.all.fa‘文件有一些序列id’s(ENSMUST.).I必须找到'deneme.txt‘文件和'Mus_musculus.GRCm38.74.cdna.all.fa’文件之间的匹配。
它应该需要4-5小时,但是使用此代码至少需要10小时。
任何帮助都会受到感谢,因为我必须摆脱这种残酷的算法,变得更贪婪。谢谢
发布于 2014-02-19 22:34:30
我认为这仍然产生同样的爆炸,但应该更快。请阅读代码中的注释,以获得更多的优化:
tf_data = {key: (int(val1), int(val2)) for key, val1, val2 in
(line.description.split() for line in
SeqIO.parse('deneme2.txt','fasta'))}
for cd_line in SeqIO.parse('Mus_musculus.GRCm38.74.cdna.all.fa','fasta'):
if cd_line.id in tf_data;
tempfile=open('tempfile.txt','w')
tf_val1, tf_va2 = tf_data[cd_line.id]
#If it is likely that the same tf_data-record is used many times
#move the math to the first line, if on the other hand it is
#very likely that most records won't be used in tf_data then
#move the int-casts back to the line below
tempfile.write('>{0}\n{1}'.format(
cd_line.id,
str(cd_line.seq)[tf_val1 - 100: tf_val2 + 100]))
tempfile.close()
os.system('makeblastdb -in tempfile.txt -dbtype nucl '
'-out tempfile.db -title \'tempfile\'')
cline = NcbiblastnCommandline(
query='SRR029153.fasta',
db="tempfile.db",
outfmt=7,
out=("{0} {1}".format(tf_val1, tf_val2)))
#Since not using stderr and stdout don't assign variables
cline()https://stackoverflow.com/questions/21890860
复制相似问题