首先,我想说我是python编程的新手。我花了将近20个小时来弄清楚这一点,然而,我的代码仍然令人沮丧。我有一个由ID和DNA序列组成的fasta文件。我想读入FASTA数据并做一些计算工作。
FASTA文件如下所示:
>1111886
AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGA…
>1111885
AGAGTTTGATCCTGGCTCAGAATGAACGCTGGCGGCGTGCCT…
>1111883
GCTGGCGGCGTGCCTAACACATGTAAGTCGAACGGGACTGGG…我编写了以下代码来读取fasta数据,并进行一些分析,例如根据ID计算gc含量、平均序列长度等。我在docstring中对我想要做的事情进行了详细描述。我很感谢任何人谁可以改进我的代码,特别是如何获得每个ID的gc内容。
class fasta(object):
def __init__(self, filename):
self.filename = filename
self.num_sequences = None
self.sequences = {} #{seq_id} = sequence
def parse_file(self):
**"""Reads in the sequence data contained in the filename associated with this instance of the class.
Stores both the sequence and the optional comment for each ID."""**
with open(self.filename) as f:
return f.read().split('>')[1:]
def get_info(self):
**"""Returns a description of the class in a pretty string format.
The description should include the filename for the instance and the number of sequences."""**
for line in file(self.filename, 'r'):
if line.startswith('>'):
self.num_sequences += 1
return self.num_sequences
def compute_gc_content(self,some_id):
**"""compute the gc conent for sequence ID some_id. If some_id, return an appropriate error values"""**
baseFrequency = {}
for line in file(self.filename, 'r'):
if not line.startswith(">"):
for base in sequence:
baseFrequency[base] = baseFrequency.get(base,0)+1
items = baseFrequency.items()
items.sort()
for i in items:
gc=(baseFrequency['G'] + baseFrequency['C'])/float(len(sequence))
return gc
def sequence_statistics(self):
**"""returns a dictionary containing
The average sequence length
The average gc content"""**
baseFrequency = {}
for line in file(self.filename, 'r'):
if not line.startswith(">"):
for base in sequence:
baseFrequency[base] = baseFrequency.get(base,0)+1
items = baseFrequency.items()
items.sort()
for i in items:
gc=(baseFrequency['G'] + baseFrequency['C'])/float(len(sequence))
aveseq=sum(len(sequence))/float(self.count)
return (gc,aveseq)
def get_all_kmers(self, k=8):
**"""get all kmer counts of size k in fasta file. Returns a dictionary with keys equal to the kmers
and values equal to the counts"""**
t={}
for x in range (len(self.sequence)+1-k):
kmer=self.sequence[x:x+k]
t[kmer]=f.get(kmer,0)+1
kmers = get_all_kmers(k=8)
return(t,len(kmers))
def query_sequence_id(self, some_id):
**"""query sequence ids for some_id. If some_id does not exist in the class, return
a string error message"""**
for line in file(self.filename, 'r'):
if id in line:
print "The id exists"
else:
print "The id does not exist"发布于 2014-02-12 05:16:06
这应该能够读取和解析您的文件。将其保存在字典中,id作为关键字,info\data作为每个id的子字典。
如果你不了解所有这些内容,请询问。
def parse_file(self):
"""Reads in the sequence data contained in the filename associated with this instance of the class.
Stores both the sequence and the optional comment for each ID."""
def parser(filename):
seq = []
with open(filename, 'r') as f:
for line in f:
if line.startswith('>'):
if seq:
yield seqId, seqInfo, ''.join(seq)
line = line.split()
seqId = line[0][1:]
if len(line) > 1:
seqInfo = ' '.join(line[1:])
else:
seqInfo = ''
seq = []
else:
seq.append(line.replace('\n', ''))
if seq:
yield seqId, seqInfo, ''.join(seq)
sequences = parser(self.filename)
self.sequences = {sequenceId: {'info': sequenceInfo, 'data': sequenceData} for (sequenceId, sequenceInfo, sequenceData) in sequences}https://stackoverflow.com/questions/21712279
复制相似问题