假设我在一个文本文件中有一个左右部分之间有一个制表符的字符串:
The dreams of REM (Geo) sleep The sleep paralysis我想匹配上面的字符串,匹配另一个文件的每一行中的左部分和右部分:
The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep. 如果不能与填充字符串匹配,则尝试与子字符串匹配。
我想用最左边和最右边的模式进行搜索。例如(最左边的情况)
The dreams of REM sleep paralysis
The dreams of REM sleep The sleep例如(大多数情况下):
REM sleep The sleep paralysis
The dreams of The sleep paralysis再次感谢您的帮助。
发布于 2011-07-05 06:57:45
(好的,你已经阐明了你想要的大部分内容。让我重申一遍,然后澄清我在下面列出的仍然不清楚的几点。还可以使用我向您展示的起始代码,调整它,并将结果发布给我们。)
您希望逐行、不区分大小写地对一对匹配模式中的每一个进行搜索,以查找最长的连续匹配。所有的模式似乎都是不相交的(不可能在patternX和patternY上都匹配,因为它们使用了不同的短语,例如不能同时匹配“额叶”和“前额皮质”)。
您的模式是以一系列配对('dom','rang')的形式提供的,让我们仅通过它们的下标来引用它们,[1,您可以使用string.split('\t')来获取。)重要的是,匹配的行必须同时匹配dom和rang模式(完全或部分)。顺序是独立的,所以我们可以匹配rang,然后匹配dom,反之亦然,=>每行使用2个单独的正则表达式,并测试d和r匹配。
模式有可选的部分,在括号=>中,所以只需使用(optionaltext)?语法编写/转换它们到正则表达式语法,例如:re.compile('Frontallobes of (leftside)? the brain', re.IGNORECASE)
返回值应该是到目前为止子串匹配最长的字符串缓冲区。
现在有几件事需要澄清-请编辑您的问题以解释以下内容:
以上每个问题都会影响解决方案,所以您需要为我们回答它们。当您只需要一些简单的东西时,编写大量代码来解决最一般的情况是没有意义的。一般来说,这被称为'NLP‘(自然语言处理)。您可能最终会使用NLP库。
到目前为止,代码的一般结构听起来像这样:
import re
# normally, read your input directly from file, but this allows us to test:
input = """The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
The optic tract is a part of the visual system in the brain.
The inferior frontal gyrus is a gyrus of the frontal lobe of the human brain.
The prefrontal cortex (PFC) is the anterior part of the frontallobes of the brain, lying in front of the motor and premotor areas.
There are three possible ways to define the prefrontal cortex as the granular frontal cortex as that part of the frontal cortex whose electrical stimulation does not evoke movements.
This allowed the establishment of homologies despite the lack of a granular frontal cortex in nonprimates.
Modern tracing studies have shown that projections of the mediodorsal nucleus of the thalamus are not restricted to the granular frontal cortex in primates.
""".split('\n')
patterns = [
('(dreams of REM (Geo)? sleep)', '(sleep paralysis)'),
('(frontal lobe)', '(inferior frontal gyrus)'),
('(prefrontal cortex)', '(frontallobes of (leftside )?(the )?brain)'),
('(modern tract)', '(probably mediodorsal nucleus)') ]
# Compile the patterns as regexes
patterns = [ (re.compile(dstr),re.compile(rstr)) for (dstr,rstr) in patterns ]
def longest(t):
"""Get the longest from a tuple of strings."""
l = list(t) # tuples can't be sorted (immutable), so convert to list...
l.sort(key=len,reverse=True)
return l[0]
def custommatch(line):
for (d,r) in patterns:
# If got full match to both (d,r), return it immediately...
(dm,rm) = (d.findall(line), r.findall(line))
# Slight design problem: we get tuples like: [('frontallobes of the brain', '', 'the ')]
#... so return the longest match strings for each of dm,rm
if dm and rm: # must match both dom & rang
return [longest(dm), longest(rm)]
# else score any partial matches to (d,r) - how exactly?
# TBD...
else:
# We got here because we only have partial matches (or none)
# TBD: return the 'highest-scoring' partial match
return ('TBD... partial match')
for line in input:
print custommatch(line)并在您提供的7行输入上运行,目前提供:
TBD... partial match
TBD... partial match
['frontal lobe', 'inferior frontal gyrus']
['prefrontal cortex', ('frontallobes of the brain', '', 'the ')]
TBD... partial match
TBD... partial match
TBD... partial match
TBD... partial matchhttps://stackoverflow.com/questions/6576181
复制相似问题