我正在研究Rosalind问题,特别是题为“共识和概况”的问题。
数据输入如下:
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT上面是七个带有ID或头部的DNA序列,输出应该是这样的:
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6 现在,到目前为止,这是我的代码,我想生成上面的矩阵,它按列计算所有A的C,G和T:
import sys
import Bio.SeqIO
count = {}
count=OrderedDict()
list_seq = []
for seq in Bio.SeqIO.parse(sys.stdin, 'fasta'):
sequn = str(seq.seq)
print "sequn",sequn
for i,nuc in enumerate(sequn):
print "nuc", nuc
key = (nuc,i)
try:
count[key] = count[key]+1
except KeyError:
count[key] = 1字典计数的输出如下所示:
([(('A', 0), 5), (('T', 1), 5), (('C', 2), 1), (('C', 3), 4), (('A', 4), 5),
(('G', 5), 1), (('C', 6), 6), (('T', 7), 6), (('G', 0), 1), (('G', 1), 1),
(('G', 2), 6), (('A', 5), 5), (('G', 3), 3), (('T', 5), 1), (('A', 1), 1),
(('C', 7), 1), (('T', 0), 1), (('C', 4), 2), (('T', 6), 1)])我想从上面的输出字典生成输出矩阵,该怎么做呢?
在此之前非常感谢。
发布于 2015-04-08 13:53:56
d = {}
count = ([(('A', 0), 5), (('T', 1), 5), (('C', 2), 1), (('C', 3), 4), (('A', 4), 5),(('G', 5), 1), (('C', 6), 6), (('T', 7), 6), (('G', 0), 1), (('G', 1), 1),
(('G', 2), 6), (('A', 5), 5), (('G', 3), 3), (('T', 5), 1), (('A', 1), 1),
(('C', 7), 1), (('T', 0), 1), (('C', 4), 2), (('T', 6), 1)])
for each in count:
if each[0][0] in d:
li = d[each[0][0]]
spot = each[0][1]
li[spot] = each[1]
d[each[0][0]] = li
else:
li=[0]*8
spot = each[0][1]
li[spot] = each[1]
d[each[0][0]] = li
for each in sorted(d):
print each," ",d[each]
sol=""
for each in range(8):
sol+=max(d, key=lambda x:d[x][each])
print sol我只是迭代了整个字典,并按照你的问题创建了一个新的字典。
但您可以在修改字典计数的同时执行此操作。我假设列表长度为8。如果大于8,则应相应修改以上内容。
如果你能直接编辑问题就好了。
发布于 2015-05-21 15:48:41
这是一个使用BioPython和collections.Counter的解决方案
from Bio import SeqIO
from collections import Counter
def main(fasta_file):
"""
>>> print main(r'./data/CONS_sample.fa')
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
"""
with open(fasta_file) as fh:
dna_strings = [str(fasta.seq) for fasta in SeqIO.parse(fh, 'fasta')]
transposed = zip(*dna_strings)
counters = [Counter(column) for column in transposed]
# create consensus
consensus = ''.join([counter.most_common(1)[0][0] for counter in counters])
# create profile matrix
matrix = ''
for base in 'ACGT':
matrix += '{}:'.format(base)
for counter in counters:
matrix += ' {}'.format(counter[base])
matrix += '\n'
matrix = matrix.rstrip()
return '\n'.join([consensus, matrix])
if __name__ == '__main__':
import doctest
doctest.testmod()
print main(r'./data/CONS.txt')https://stackoverflow.com/questions/29506417
复制相似问题