我有一个python列表,其中包含大约700个术语,我希望将这些术语用作Django中的一些数据库条目的元数据。我希望将列表中的术语与条目描述进行匹配,以查看是否有匹配的术语,但有几个问题。我的第一个问题是,列表中有一些多词术语,其中包含来自其他列表条目的单词。下面是一个例子:
Intrusion
Intrusion Detection我对re.findall的使用不是很深入,因为它将与上面示例中的入侵和入侵检测相匹配。我只想匹配入侵检测,而不是入侵。
有没有更好的方法来进行这种类型的匹配?我想也许可以尝试NLTK,但它看起来对这种类型的匹配没有帮助。
编辑:
因此,为了更加清楚,我列出了700个术语,例如防火墙或入侵检测。我想尝试将列表中的这些词与我存储在数据库中的描述进行匹配,看看是否有匹配的词,我将在元数据中使用这些词。因此,如果我有以下字符串:
There are many types of intrusion detection devices in production today. 如果我有一个包含以下术语的列表:
Intrusion
Intrusion Detection我想匹配“入侵检测”,但不匹配“入侵”。真的,我也希望能够匹配单数/复数实例,但我可能有些言过其实了。所有这一切背后的想法是将所有的匹配放在一个列表中,然后对它们进行处理。
发布于 2011-05-24 16:26:51
如果需要更灵活地匹配条目描述,可以组合使用nltk和re
from nltk.stem import PorterStemmer
import re假设你对同一事件有不同的描述。系统的重写。您可以使用nltk.stem捕获重写、单数和复数形式等。
master_list = [
'There are many types of intrusion detection devices in production today.',
'The CTO approved a rewrite of the system',
'The CTO is about to approve a complete rewrite of the system',
'The CTO approved a rewriting',
'Breaching of Firewalls'
]
terms = [
'Intrusion Detection',
'Approved rewrite',
'Firewall'
]
stemmer = PorterStemmer()
# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)
# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')
for sentence in master_list:
match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
matches = [m.group(0) for m in match_obs if m]
print(matches)输出:
['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']
['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']编辑:
要查看哪个terms导致了匹配:
for sentence in master_list:
# regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
for term, pattern in zip(terms, regex_patterns):
if re.search(pattern, sentence, flags=re.IGNORECASE):
# process term (put it in the db)
print('TERM: {0} FOUND IN: {1}'.format(term, sentence))输出:
TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewallshttps://stackoverflow.com/questions/6104576
复制相似问题