我正在尝试使用Python正则表达式从基因组数据库中提取基因组序列;我已经粘贴了下面数据库的一个片段。
>GSVIVT01031739001 pacid=17837850 polypeptide=GSVIVT01031739001 locus=GSVIVG01031739001 ID=GSVIVT01031739001.Genoscope12X annot-version=Genoscope.12X ATGAAAACGGAACTCTTTCTAGGTCATTTCCTCTTCAAACAAGAAAGAAGTAAAAGTTGCATACCAAATATGGACTCGAT TTGGAGTCGTAGTGCCCTGTCCACAGCTTCGGACTTCCTCACTGCAATCTACTTCGCCTTCATCTTCATCGTCGCCAGGT TTTTCTTGGACAGATTCATCTATCGAAGGTTGGCCATCTGGTTATTGAGCAAGGGAGCTGTTCCATTGAAGAAAAATGAT GCTACACTGGGAAAAATTGTAAAATGTTCGGAGTCTTTGTGGAAACTAACATACTATGCAACTGTTGAAGCATTCATTCT TGCTATTTCCTACCAAGAGCCATGGTTTAGAGATTCAAAGCAGTACTTTAGAGGGTGGCCAAATCAAGAGTTGACGCTTC CCCTCAAGCTTTTCTACATGTGCCAATGTGGGTTCTACATCTACAGCATTGCTGCCCTTCTTACATGGGAAACTCGCAGG AGGGATTTCTCTGTGATGATGTCTCATCATGTAGTCACTGTTATCCTAATTGGGTACTCATACATATCAAGTTTTGTCCG GATCGGCTCAGTTGTCCTTGCCCTGCACGATGCAAGTGATGTCTTCATGGAAGCTGCAAAAGTTTTTAAATATTCTGAGA AGGAGCTTGCAGCAAGTGTGTGCTTTGGATTTTTTGCCATCTCATGGCTTGTCCTACGGTTAATATTCTTTCCCTTTTGG GTTATCAGTGCATCAAGCTATGATATGCAAAATTGCATGAATCTATCGGAGGCCTATCCCATGTTGCTATACTATGTTTT CAATACAATGCTCTTGACACTACTTGTGTTCCATATATACTGGTGGATTCTTATATGCTCAATGATTATGAGACAGCTGA AAAATAGAGGACAAGTTGGAGAAGATATAAGATCTGATTCAGAGGACGATGAATAG
>GSVIVT01031740001 pacid=17837851 polypeptide=GSVIVT01031740001 locus=GSVIVG01031740001 ID=GSVIVT01031740001.Genoscope12X annot-version=Genoscope.12X ATGGGTATTACTACTTCCCTCTCATATCTTTTATTCTTCAACATCATCCTCCCAACCTTAACGGCTTCTCCAATACTGTT TCAGGGGTTCAATTGGGAATCATCCAAAAAGCAAGGAGGGTGGTACAACTTCCTCATCAACTCCATTCCTGAACTATCTG CCTCTGGAATCACTCATGTTTGGCTTCCTCCACCCTCTCAGTCTGCTGCATCTGAAGGGTACCTGCCAGGAAGGCTTTAT GATCTCAATGCATCCCACTATGGTACCCAATATGAACTAAAAGCATTGATAAAGGCATTTCGCAGCAATGGGATCCAGTG CATAGCAGACATAGTTATAAACCACAGGACTGCTGAGAAGAAAGATTCAAGAGGAATATGGGCCATCTTTGAAGGAGGAA CCCCAGATGATCGCCTTGACTGGGGTCCATCTTTTATCTGCAGTGATGACACTCTTTTTTCTGATGGCACAGGAAATCCT GATACTGGAGCAGGCTTCGATCCTGCTCCAGACATTGATCATGTAAACCCCCGGGTCCAGCGAGAGCTATCAGATTGGAT GAATTGGTTAAAGATTGAAATAGGCTTTGCTGGATGGCGATTCGATTTTGCTAGAGGATACTCCCCAGATTTTACCAAGT TGTATATGGAAAACACTTCGCCAAACTTTGCAGTAGGGGAAATATGGAATTCTCTTTCTTATGGAAATGACAGTAAGCCA AACTACAACCAAGATGCTCATCGGCGTGAGCTTGTGGACTGGGTGAAAGCTGCTGGAGGAGCAGTGACTGCATTTGATTT TACAACCAAAGGGATACTCCAAGCTGCAGTGGAAGGGGAATTGTGGAGGCTGAAGGACTCAAATGGAGGGCCTCCAGGAA TGATTGGCTTAATGCCTGAAAATGCTGTGACTTTCATAGATAATCATGACACAGGTTCTACACAAAAAATTTGGCCATTC CCATCAGACAAAGTCATGCAGGGATATGTTTATATCCTCACTCATCCTGGGATTCCATCCATATTCTATGACCACTTCTT TGACTGGGGTCTGAAGGAGGAGATTTCTAAGCTGATCAGTATCAGGACCAGGAACGGGATCAAACCCAACAGTGTGGTGC GTATTCTGGCATCTGACCCAGATCTTTATGTAGCTGCCATAGATGAGAAAATCATTGCTAAGATTGGACCAAGGTATGAT GTTGGGAACCTTGTACCTTCAACCTTCAAACTTGCCACCTCTGGCAACAATTATGCTGTGTGGGAGAAACAGTAA
>GSVIVT01031741001 pacid=17837852 polypeptide=GSVIVT01031741001 locus=GSVIVG01031741001 ID=GSVIVT01031741001.Genoscope12X annot-version=Genoscope.12X ATGTCCAAATTAACTTATTTATTATCTCGGTACATGCCAGGAAGGCTTTATGATCTGAATGCATCCAAATATGGCACCCA AGATGAACTGAAAACACTGATAAAGGTGTTTCACAGCAAGGGGGTCCAGTGCATAGCAGACATAGTTATAAACCACAGAA CTGCAGAGAAGCAAGACGCAAGAGGAATATGGCCATCTTTGAAGGAGGAACCCCAGATGATCGCCTTGACTGGACCCCAT CTTTCCTTTGCAAGGACGACACTCCTTATTCCGACGGCACCGGAAACCCTGATTCTGGAGATGACTACAGTGCCGCACCA GACATCGACCACATCAACCCACGGGTTCAGCAAGAGCTAA我要做的是获得GSVIV01031740001 (中间序列)的基因组(ACGT)序列,而不是其他序列。我当前的正则表达式是
sequence = re.compile('(?<=>GSVIVT01031740001) pacid=.*annot-version=.*\n[ACGT\n]*[^(?<!>GSVIVT01031740001) pacid]’)我的逻辑是找到具有正确生物体的genbank ID的头,给我那一行,然后转到新行,给我所有ACGT和新行,直到我得到一个具有不同genbank ID的生物体的头。这不会给出任何结果。
是的,我知道re.compile实际上并不执行搜索;我正在搜索一个作为“目标”打开的文件,所以我的执行看起来像这样。
>>> for nucl in target:
... if re.search(sequence, nucl):
... print(nucl)有人能告诉我我做错了什么吗,无论是在我的正则表达式中还是在一开始就使用了正则表达式?当我在regex101.com,上尝试时,它可以工作,但当我在Python解释器(2.7.1)中尝试它时,它失败了。
谢谢!
发布于 2015-03-19 03:32:15
如果我没理解错的话,你只想要给定基因座的基因组序列。所以你可以这样做(假设你的数据在一个文件中)。
lines = [line.split(' ') for line in open('results.txt') ]
somedict = {}
for each in lines:
locus = each[3].split('=')[-1]
seq = ''.join(each[6:])
somedict[locus] = seq
print somedict它输出一个字典,其中locus作为键,sequence作为值
{'GSVIVG01031741001': 'ATGTCCAAATTAACTTATTTATTATCTCGGTACATGCCAGGAAGGCTTTATGATCTGAATGCATCCAAATATGGCACCCAAGATGAACTGAAAACACTGATAAAGGTGTTTCACAGCAAGGGGGTCCAGTGCATAGCAGACATAGTTATAAACCACAGAACTGCAGAGAAGCAAGACGCAAGAGGAATATGGCCATCTTTGAAGGAGGAACCCCAGATGATCGCCTTGACTGGACCCCATCTTTCCTTTGCAAGGACGACACTCCTTATTCCGACGGCACCGGAAACCCTGATTCTGGAGATGACTACAGTGCCGCACCAGACATCGACCACATCAACCCACGGGTTCAGCAAGAGCTAA\n', 'GSVIVG01031740001': 'ATGGGTATTACTACTTCCCTCTCATATCTTTTATTCTTCAACATCATCCTCCCAACCTTAACGGCTTCTCCAATACTGTTTCAGGGGTTCAATTGGGAATCATCCAAAAAGCAAGGAGGGTGGTACAACTTCCTCATCAACTCCATTCCTGAACTATCTGCCTCTGGAATCACTCATGTTTGGCTTCCTCCACCCTCTCAGTCTGCTGCATCTGAAGGGTACCTGCCAGGAAGGCTTTATGATCTCAATGCATCCCACTATGGTACCCAATATGAACTAAAAGCATTGATAAAGGCATTTCGCAGCAATGGGATCCAGTGCATAGCAGACATAGTTATAAACCACAGGACTGCTGAGAAGAAAGATTCAAGAGGAATATGGGCCATCTTTGAAGGAGGAACCCCAGATGATCGCCTTGACTGGGGTCCATCTTTTATCTGCAGTGATGACACTCTTTTTTCTGATGGCACAGGAAATCCTGATACTGGAGCAGGCTTCGATCCTGCTCCAGACATTGATCATGTAAACCCCCGGGTCCAGCGAGAGCTATCAGATTGGATGAATTGGTTAAAGATTGAAATAGGCTTTGCTGGATGGCGATTCGATTTTGCTAGAGGATACTCCCCAGATTTTACCAAGTTGTATATGGAAAACACTTCGCCAAACTTTGCAGTAGGGGAAATATGGAATTCTCTTTCTTATGGAAATGACAGTAAGCCAAACTACAACCAAGATGCTCATCGGCGTGAGCTTGTGGACTGGGTGAAAGCTGCTGGAGGAGCAGTGACTGCATTTGATTTTACAACCAAAGGGATACTCCAAGCTGCAGTGGAAGGGGAATTGTGGAGGCTGAAGGACTCAAATGGAGGGCCTCCAGGAATGATTGGCTTAATGCCTGAAAATGCTGTGACTTTCATAGATAATCATGACACAGGTTCTACACAAAAAATTTGGCCATTCCCATCAGACAAAGTCATGCAGGGATATGTTTATATCCTCACTCATCCTGGGATTCCATCCATATTCTATGACCACTTCTTTGACTGGGGTCTGAAGGAGGAGATTTCTAAGCTGATCAGTATCAGGACCAGGAACGGGATCAAACCCAACAGTGTGGTGCGTATTCTGGCATCTGACCCAGATCTTTATGTAGCTGCCATAGATGAGAAAATCATTGCTAAGATTGGACCAAGGTATGATGTTGGGAACCTTGTACCTTCAACCTTCAAACTTGCCACCTCTGGCAACAATTATGCTGTGTGGGAGAAACAGTAA\n', 'GSVIVG01031739001': 'ATGAAAACGGAACTCTTTCTAGGTCATTTCCTCTTCAAACAAGAAAGAAGTAAAAGTTGCATACCAAATATGGACTCGATTTGGAGTCGTAGTGCCCTGTCCACAGCTTCGGACTTCCTCACTGCAATCTACTTCGCCTTCATCTTCATCGTCGCCAGGTTTTTCTTGGACAGATTCATCTATCGAAGGTTGGCCATCTGGTTATTGAGCAAGGGAGCTGTTCCATTGAAGAAAAATGATGCTACACTGGGAAAAATTGTAAAATGTTCGGAGTCTTTGTGGAAACTAACATACTATGCAACTGTTGAAGCATTCATTCTTGCTATTTCCTACCAAGAGCCATGGTTTAGAGATTCAAAGCAGTACTTTAGAGGGTGGCCAAATCAAGAGTTGACGCTTCCCCTCAAGCTTTTCTACATGTGCCAATGTGGGTTCTACATCTACAGCATTGCTGCCCTTCTTACATGGGAAACTCGCAGGAGGGATTTCTCTGTGATGATGTCTCATCATGTAGTCACTGTTATCCTAATTGGGTACTCATACATATCAAGTTTTGTCCGGATCGGCTCAGTTGTCCTTGCCCTGCACGATGCAAGTGATGTCTTCATGGAAGCTGCAAAAGTTTTTAAATATTCTGAGAAGGAGCTTGCAGCAAGTGTGTGCTTTGGATTTTTTGCCATCTCATGGCTTGTCCTACGGTTAATATTCTTTCCCTTTTGGGTTATCAGTGCATCAAGCTATGATATGCAAAATTGCATGAATCTATCGGAGGCCTATCCCATGTTGCTATACTATGTTTTCAATACAATGCTCTTGACACTACTTGTGTTCCATATATACTGGTGGATTCTTATATGCTCAATGATTATGAGACAGCTGAAAAATAGAGGACAAGTTGGAGAAGATATAAGATCTGATTCAGAGGACGATGAATAG\n'}https://stackoverflow.com/questions/29130813
复制相似问题