文章/答案/技术大牛

发布

社区首页 >问答首页 >在python中匹配大型列表与字符串的最佳方法

问在python中匹配大型列表与字符串的最佳方法
EN

Stack Overflow用户

提问于 2011-05-24 08:33:48

回答 1查看 1.6K关注 0票数 2

我有一个python列表，其中包含大约700个术语，我希望将这些术语用作Django中的一些数据库条目的元数据。我希望将列表中的术语与条目描述进行匹配，以查看是否有匹配的术语，但有几个问题。我的第一个问题是，列表中有一些多词术语，其中包含来自其他列表条目的单词。下面是一个例子：

Intrusion
Intrusion Detection

我对re.findall的使用不是很深入，因为它将与上面示例中的入侵和入侵检测相匹配。我只想匹配入侵检测，而不是入侵。

有没有更好的方法来进行这种类型的匹配？我想也许可以尝试NLTK，但它看起来对这种类型的匹配没有帮助。

编辑：

因此，为了更加清楚，我列出了700个术语，例如防火墙或入侵检测。我想尝试将列表中的这些词与我存储在数据库中的描述进行匹配，看看是否有匹配的词，我将在元数据中使用这些词。因此，如果我有以下字符串：

There are many types of intrusion detection devices in production today.

如果我有一个包含以下术语的列表：

Intrusion
Intrusion Detection

我想匹配“入侵检测”，但不匹配“入侵”。真的，我也希望能够匹配单数/复数实例，但我可能有些言过其实了。所有这一切背后的想法是将所有的匹配放在一个列表中，然后对它们进行处理。

python

list

pattern-matching

回答 1

Stack Overflow用户

回答已采纳

发布于 2011-05-24 16:26:51

如果需要更灵活地匹配条目描述，可以组合使用nltk和re

from nltk.stem import PorterStemmer
import re

假设你对同一事件有不同的描述。系统的重写。您可以使用nltk.stem捕获重写、单数和复数形式等。

master_list = [
    'There are many types of intrusion detection devices in production today.',
    'The CTO approved a rewrite of the system',
    'The CTO is about to approve a complete rewrite of the system',
    'The CTO approved a rewriting',
    'Breaching of Firewalls'
]

terms = [
    'Intrusion Detection',
    'Approved rewrite',
    'Firewall'
]

stemmer = PorterStemmer()

# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)

# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')

for sentence in master_list:
    match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
    matches = [m.group(0) for m in match_obs if m]
    print(matches)

输出：

['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']

['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']

编辑：

要查看哪个terms导致了匹配：

for sentence in master_list:
    # regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
    for term, pattern in zip(terms, regex_patterns):
        if re.search(pattern, sentence, flags=re.IGNORECASE):
            # process term (put it in the db)
            print('TERM: {0} FOUND IN: {1}'.format(term, sentence))

输出：

TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/6104576

复制

相似问题

问在python中匹配大型列表与字符串的最佳方法
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中匹配大型列表与字符串的最佳方法EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中匹配大型列表与字符串的最佳方法
EN