我尝试在一个'>‘字符之前添加多个行到一个列表中,这样我就可以将它转换为字典中的值。例如,我试图使:
> 1
AAA
CCC
> 2成为AAACCC。
代码如下:
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = []
for k, line in enumerate(thefile):
if line.startswith('>'):
labeler = line.strip('>').strip('\n')
label.append(labeler)
else:
seqfix = ''.join(line.strip('\n'))
sequences.append(seqfix)
dict_version = {k: v for k, v in zip(label, sequences)}
return dict_version
parse_fasta('small.fasta')发布于 2019-08-12 06:33:47
你可以边走边创建字典。这里有一个方法可以做到这一点。
编辑:删除了defaultdict (没有模块)
from pprint import pprint
dict_version = {}
with open('fasta_sample.txt', 'r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
key = line[1:]
else:
if key in dict_version:
dict_version[key] += line
else:
dict_version[key] = line
pprint(dict_version)示例文件:
>1FN3:A|PDBID|CHAIN|SEQUENCE
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
>5OKT:A|PDBID|CHAIN|SEQUENCE
MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQ
GGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGK
KGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAAT
KRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*
>2PAB:A|PDBID|CHAIN|SEQUENCE
GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWK
ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*
>3IDP:B|PDBID|CHAIN|SEQUENCE
HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRK
TRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHED
LTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIF
MVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS
>4QUD:A|PDBID|CHAIN|SEQUENCE
MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRN
LKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFII
QACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVN
RKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH创建的字典的精美印刷体是:
{'1FN3:A|PDBID|CHAIN|SEQUENCE': 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR',
'2PAB:A|PDBID|CHAIN|SEQUENCE': 'GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*',
'3IDP:B|PDBID|CHAIN|SEQUENCE': 'HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRKTRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHEDLTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIFMVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS',
'4QUD:A|PDBID|CHAIN|SEQUENCE': 'MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRNLKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFIIQACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVNRKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH',
'5OKT:A|PDBID|CHAIN|SEQUENCE': 'MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQGGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGKKGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAATKRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*'}编辑:在您尝试之后执行解决方案:
from pprint import pprint
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = ''
total_seq = []
for line in thefile:
line = line.strip()
if len(line) == 0:
continue
if line.startswith('>'):
line = line.strip('>')
label.append(line)
if len(sequences) > 0:
total_seq.append(sequences)
sequences = ''
else:
sequences += line
total_seq.append(sequences)
dict_version = {k: v for k, v in zip(label, total_seq)}
return dict_version
d = parse_fasta('fasta_sample.txt')
pprint(d)您将看到我做了一些更改以获得正确的输出。我添加了一个数组total_seq来保存每个序列标头的序列。(您没有这个,并且在您的解决方案中是一个问题)。代码中的joins没有做任何事情。虽然您的想法是正确的,但值只是一个字符串。在修改后的代码中,您将看到join用于将一个头id的累积序列连接到一个fasta字符串中。
我测试了空行,如果该行是空的,则执行continue (len(line) == 0)。
有一个测试if len(sequences) > 0,看看是否已经看到了任何序列。在第一张唱片里他们不会这么做的。它会在看到任何序列之前看到ID。
for循环完成后,需要添加最后一个序列
total_seq.append(sequences)
因为当检测到新的ID时,除了最后的序列之外的所有其它序列都被添加到total_seq。
我希望这个解释是有帮助的,因为它更紧跟您的代码。
https://stackoverflow.com/questions/57453966
复制相似问题