我有这样一个fastq文件(文件的一部分):
@A80HNBABXX:4:1:1344:2224#0/1
AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG
+
\\YYWX\PX^YT[TVYaTY]^\^H\`^`a`\UZU__TTbSbb^\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@A80HNBABXX:4:1:1515:2211#0/1
TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA
+
ee^e^\`ad`eeee\dd\ddddYeebdd\ddaYbdcYc`\bac^YX[V^\Ybb]]^bdbaZ]ZZ\^K\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_
@A80HNBABXX:4:1:1538:2220#0/1
CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT
+
fff^fd\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\a\`]aY]ZZ[XYcccYcZ\\]Y
@A80HNBABXX:4:1:1666:2222#0/1
CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT
+
deeee`bbcddddad\bbbbeee\ecYZcc^dd^ddd\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBBFASTQ文件每序列使用四行。第1行以“@”字符开头,后面跟着序列标识符。第2行是DNA序列字母。第3行以'+‘字符开头。第4行编码第2行中序列的质量值(在"+“之后和下一个"@”之前的部分),并且必须包含与序列中的字母相同的符号数。
我想将fastq文件读入这样的字典(关键是DNA序列,值是质量值,以"@“和"+”开头的行可以丢弃):
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG':'\YYWX\PX^YT[TVYaTY]^\^H`^a\UZU__TTbSbb^\a^^^[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT':'fff^fd\c^d^Ycacdcdcdedeffdfedb]beeeeecd^ddccdddddfffeaeeeffdTecacaLV[QRPa\a`]aY]ZZ[XYcccYcZ\]Y ',
....}我编写了以下代码,但它没有给我想要的。有人能帮我修复/改进我的代码吗?
class fastq(object):
def __init__(self,filename):
self.filename = filename
self.__sequences = {}
def parse_file(self):
symbol=['@','+']
"""Stores both the sequence and the quality values for the sequence"""
f = open(self.filename,'rU')
for lines in self.filename:
if symbol not in lines.startwith()
data = f.readlines()
return data发布于 2014-02-12 19:38:01
下面是一种非常快速和高效的方法:
def parse_file(self):
with open(self.filename, 'r') as f:
content = f.readlines()
# Recreate content without lines that start with @ and +
content = [line for line in content if not line[0] in '@+']
# Now the lines you want are alternating, so you can make a dict
# from key/value pairs of lists content[0::2] and content[1::2]
data = dict(zip(content[0::2], content[1::2]))
return data发布于 2014-02-12 19:47:05
我不认为使用读取作为关键是一个好主意,如果你得到完全相同的阅读。但如果你想这么做的话:
In [9]:
with open('temp.fastq') as f:
lines=f.readlines()
head=[item[:-1] for item in lines[::4]] #get rid of '\n'
read=[item[:-1] for item in lines[1::4]]
qual=[item[:-1] for item in lines[3::4]]
dict(zip(read, qual))
Out[9]:
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG': '\\\\YYWX\\PX^YT[TVYaTY]^\\^H\\`^`a`\\UZU__TTbSbb^\\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT': 'fff^fd\\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\\\a\\`]aY]ZZ[XYcccYcZ\\\\]Y',
'CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT': 'deeee`bbcddddad\\bbbbeee\\ecYZcc^dd^ddd\\\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB',
'TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA': 'ee^e^\\`ad`eeee\\dd\\ddddYeebdd\\ddaYbdcYc`\\bac^YX[V^\\Ybb]]^bdbaZ]ZZ\\^K\\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_'}https://stackoverflow.com/questions/21737762
复制相似问题