问题:
我正在寻找一种方法,以匹配特定的标识符在一个给定的行,以特定的词开始。ID由字符组成,可能后面跟着数字,后面跟着一个破折号,然后是更多的数字。只有在起始词为下列之一的行上才应匹配ID :Close、Fixes、Resolves。如果一行包含多个If,则这些If将由字符串and分隔。任何数目的ID都可以出现在一行中。
示例测试字符串:
Closes PD-1 # Match: PD-1
Related to PD-2 # No match, line doesn't start with an allowed word
Closes
NPD-1 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33 # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44 # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2 # No match, the identifier is not directly after 'Resolves'我尝试了什么:
用正则表达式来得到所有的比赛,在某些方面我总是表现得很差。例如,我尝试的一个regexp是这样的:
^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*
^(?:Closes|Fixes|Resolves)。(\w+-\d+)and分隔,但我只想在这里捕获ID,而不是分隔符:(?:(?: and )(\w+-\d+))*。这个regexp在python中的结果:
test_string = """
Closes PD-1 # Match: PD-1
Related to PD-2 # No match, line doesn't start with an allowed word
Closes
NPD-1 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33 # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44 # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2 # No match, the identifier is not directly after 'Resolves'
"""
ids = []
for match in re.findall("^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*", test_string, re.M):
for group in match:
if group:
ids.append(group)
print(ids)
['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-44']此外,下面是regex101.com上的结果。如果在初始ID后面有多个ID,不幸的是,它只捕获了最后一个ID,而不是所有的ID。我读到重复捕获组只会捕获最后一次迭代,我应该在重复组周围放置一个捕获组来捕获所有迭代,但是我无法使它工作。
摘要
是否有一个正则表达式的解决方案,类似于我尝试过的,但捕获了所有ID出现的情况?或者是否有更好的方法使用Python解析is的字符串?
发布于 2019-11-15 16:10:20
您可以使用单个捕获组,并且在该捕获组中匹配第一次出现,并重复相同的模式-- 0+时间,前面是空格,后面是and和空格。
数值在第1组中。
要获得单独的值,请在and上拆分
^(?:Closes|Fixes|Resolves) (\w+-\d+(?: and \w+-\d+)*)发布于 2019-11-15 16:34:44
使用两阶段方法可能更容易一些,例如:
def get_matches(test): #assume test is a list of strings
regex1 = re.compile(r'^(?:Closes|Fixes|Resolves) \w+-\d+')
regex2 = re.compile(r'\w+-\d+')
results = []
for line in test:
if regex1.search(line):
results.extend(regex2.findall(line))
return results给予:
['PD-1','PD-21','PD-22','PD-31','PD-32',
'PD-33','PD4-41','PD4-42','PD4-43','PD4-44']发布于 2019-11-15 17:46:01
如果需要使用重复捕获组,则应安装带有PyPi regex模块的pip install regex,并使用
import regex
test_string = "your string here"
ids = []
for match in regex.finditer("^(?:Closes|Fixes|Resolves) (?P<id>\w+-\d+)(?:(?: and )(?P<id>\w+-\d+))*", test_string, regex.M):
ids.extend(match.captures("id"))
print(ids)
# => ['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-42', 'PD4-43', 'PD4-44']每个组的捕获堆栈可以通过match.captures(X)访问。
您所拥有的正则表达式可以按原样使用,但是这里有一个命名的捕获组,它更适合用户。
https://stackoverflow.com/questions/58880341
复制相似问题