我正在尝试使用正则表达式在一些数据中捕获蛋白质名称及其对应的氨基酸序列。以下是我的代码的精简版本:
import re
line=">sp|A0A385XJ53|INSA9_ECOLI Insertion element OS=Escherichia coli (strain K12) PE=3 SV=1 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA >sp|A0A385XJE6|INH21_ECOLI Transposase InsH for insertion sequence element OS=Escherichia coli (strain K12) PE=3 SV=1 MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA >sp|A0A385XJL4|INSB9_ECOLI Insertion element IS1 9 protein OS=Escherichia coli (strain K12) PE=3 SV=2 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF"
result1=re.findall(r'SV=\d\s([A-Z]+)', line)
result2=re.findall(r'>sp\|(\w+)\|', line)
result3=re.findall(r'>sp\|(\w+)\|\.\SV=\d\s([A-Z]+)', line)
for item1 in result1:
print(item1)
for item2 in result2:
print(item2)
for item3 in result3:
print(item3)Result1输出:
MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA
MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA
MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF和result2输出:
A0A385XJ53
A0A385XJE6
A0A385XJL4但是,result3不输出任何内容。我的印象是“。在使用正则表达式时,可用于未指定字符的序列。什么语法可以用于一系列没有设定长度的未指定字符?实际上,我希望python查找>sp|(\w+)|的匹配项,直到找到SV=\d\s(A-Z+)。在这一点上,它将重置为查找>sp|(\w+)|的匹配。我如何实现这一点?我希望它输出如下内容:
A0A385XJ53 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA
A0A385XJE6 MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA
A0A385XJL4 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF我尝试过一些不同的东西,我想也许我不理解".“的用法。由于我的代码已将所有蛋白质转换为单个字符串,因此我认为可以使用"\b+“或"\b*”来代替它,因为没有新的行。我得到了下面的两个错误代码。
error Traceback (most recent call last)
<ipython-input-76-f43b57fdde31> in <module>()
8 result1=re.findall(r'SV=\d\s([A-Z]+)', line)
9 result2=re.findall(r'>sp\|(\w+)\|', line)
---> 10 result3=re.findall(r'>sp\|(\w+)\|\b*\SV=\d\s([A-Z]+)', line)
11 for item1 in result1:
12 print(item1)
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\re.py in findall(pattern, string, flags)
220
221 Empty matches are included in the result."""
--> 222 return _compile(pattern, flags).findall(string)
223
224 def finditer(pattern, string, flags=0):
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\re.py in _compile(pattern, flags)
299 if not sre_compile.isstring(pattern):
300 raise TypeError("first argument must be string or compiled pattern")
--> 301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
303 if len(_cache) >= _MAXCACHE:
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_compile.py in compile(p, flags)
560 if isstring(p):
561 pattern = p
--> 562 p = sre_parse.parse(p, flags)
563 else:
564 pattern = None
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in parse(str, flags, pattern)
853
854 try:
--> 855 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
856 except Verbose:
857 # the VERBOSE flag was switched on inside the pattern. to be
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
614 if not item or (_len(item) == 1 and item[0][0] is AT):
615 raise source.error("nothing to repeat",
--> 616 source.tell() - here + len(this))
617 if item[0][0] in _REPEATCODES:
618 raise source.error("multiple repeat",
error: nothing to repeat at position 14发布于 2019-12-04 03:23:26
在第三个模式中,对\S进行转义,这意味着匹配一个非空格字符,而不是逐字匹配S。(Is确实与S本身匹配)
当您转义点\.时,它从字面上匹配一个点,这在示例数据中不存在。
在问题I essentially want python to look for a match to >sp\|(\w+)\| and continue until it finds SV=\d\s([A-Z]+). At which point, it will reset to looking for >sp\|(\w+)\|'s match.中阅读此内容
我认为你想使用一个非贪婪的点.+?来匹配两个模式之间的内容,使它至少匹配一个字符。
>sp\|(\w+)\|.+?SV=\d\s([A-Z]+)https://stackoverflow.com/questions/59163769
复制相似问题