我只需要从NCBI (GenBank(full)格式)下载完整的基因组序列。我对“全基因组”而不是“全基因组”感兴趣。
我的脚本:
from Bio import Entrez
Entrez.email = "asiakXX@wp.pl"
gatunek='Escherichia[ORGN]'
handle = Entrez.esearch(db='nucleotide',
term=gatunek, property='complete genome' )#title='complete genome[title]')
result = Entrez.read(handle)结果我只得到了基因组的小片段,大小约为484bp:
LOCUS NZ_KE350773 484 bp DNA linear CON 23-AUG-2013
DEFINITION Escherichia coli E1777 genomic scaffold scaffold9_G, whole genome
shotgun sequence.我知道如何通过NCBI网站手动操作,但这非常耗时,我在那里使用的查询:
escherichia[orgn] AND complete genome[title]结果我得到了多个基因组,大小约为5,154,862 bp,这就是我需要通过ENTREZ.esearch做的事情。
发布于 2013-10-19 04:36:12
您已经完成了最难的部分并完成了查询,
escherichia[orgn] AND complete genome[title]因此,也可以通过Biopython将其用作搜索查询!
from Bio import Entrez
Entrez.email = "asiakXX@wp.pl"
search_term = "escherichia[orgn] AND complete genome[title]"
handle = Entrez.esearch(db='nucleotide', term=search_term)
result = Entrez.read(handle)
handle.close()
print(result['Count']) # added parenthesis 目前,它给出了140个结果,从545778205开始,与网站相同:http://www.ncbi.nlm.nih.gov/nuccore/?term=escherichia%5Borgn%5D+AND+complete+genome%5Btitle%5D
发布于 2013-11-24 07:35:53
你的问题很清楚,但完整的答案很长。我提供的代码为您想要的每个E.Coli基因组序列生成一个.fasta文件,是的,只有NCBI中的"Complete genome “。
你会看到在NCBI (http://www.ncbi.nlm.nih.gov/genome/167)中只有six complete E.Coli参考基因组:

为了帮助你,这里有他们基因组的/Refseq链接:
这里是我的代码,用于将完整的基因组序列解析为.FASTA文件...
# Imports
from Bio import Entrez
from Bio import SeqIO
#############################
# Retrieve NCBI Data Online #
#############################
Entrez.email = "asiak@wp.pl" # Always tell NCBI who you are
genomeAccessions = ['NC_000913', 'NC_002695', 'NC_011750', 'NC_011751', 'NC_017634', 'NC_018658']
search = " ".join(genomeAccessions)
handle = Entrez.read(Entrez.esearch(db="nucleotide", term=search, retmode="xml"))
genomeIds = handle['IdList']
records = Entrez.efetch(db="nucleotide", id=genomeIds, rettype="gb", retmode="text")
###############################
# Generate Genome Fasta files #
###############################
sequences = [] # store your sequences in a list
headers = [] # store genome names in a list (db_xref ids)
for i,record in enumerate(records):
file_out = open("genBankRecord_"+str(i)+".gb", "w") # store each genomes .gb in separate files
file_out.write(record.read())
file_out.close()
genomeGenbank = SeqIO.read("genBankRecord"+str(i)+".gb", "genbank") # parse in the genbank files
header = genome.features[0].qualifiers['db_xref'][0] # name the genome using db_xfred ID
sequence = genome.seq.tostring() # obtain genome sequence
headers.append('>'+header) # store genome name in list
sequences.append(sequence) # store sequence in list
fasta_out = open("genome"+str(i)+".fasta","w") # store each genomes .fasta in separate files
fasta_out.write(header) # >header ... followed by:
fasta_out.write(sequence) # sequence ...
fasta_out.close() # close that .fasta file and move on to next genome
records.close()让我知道它进行得怎么样!安迪
发布于 2014-10-14 03:49:51
这对我很有效..。
search_term = 'escherichia coli[orgn] AND complete genome[title]'
handle = Entrez.esearch(db='nucleotide', term=search_term)
genome_ids = Entrez.read(handle)['IdList']
for genome_id in genome_ids:
record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
filename = 'generated/genBankRecord_{}.gb'.format(genome_id)
print('Writing:{}'.format(filename))
with open(filename, 'w') as f:
f.write(record.read())https://stackoverflow.com/questions/18461629
复制相似问题