文章/答案/技术大牛

发布

社区首页 >问答首页 >在Python中检查长字符串中存在的模糊/近似子字符串吗？

问在Python中检查长字符串中存在的模糊/近似子字符串吗？
EN

Stack Overflow用户

提问于 2013-07-19 07:51:40

回答 6查看 30.3K关注 0票数 68

使用像leveinstein ( leveinstein或difflib)这样的算法，很容易找到近似的matches.eg。

>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

模糊匹配可以通过根据需要确定阈值来检测。

当前的要求:在较大的字符串中基于阈值查找模糊子字符串。

例如：

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string

一种强力解决方案是生成长度为N1到N+1 (或其他匹配长度)的所有子字符串，其中N是query_string的长度，然后对它们逐一使用levenstein，并查看阈值。

在python中是否有更好的解决方案，最好是python 2.7中包含的模块，还是外部可用的模块。

Python模块工作得很好，尽管它比内置的re模块稍慢一些，用于模糊子字符串情况，这是一个明显的结果，因为额外的操作。期望的输出是好的，对模糊程度的控制可以很容易地定义。

>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>

python

python-2.7

fuzzy-search

回答 6

Stack Overflow用户

回答已采纳

发布于 2013-10-30 23:59:37

即将取代re的新regex库包含了模糊匹配。

https://pypi.python.org/pypi/regex/

模糊匹配语法看起来相当有表现力，但这将使您与一个或更少的插入/添加/删除匹配。

import regex
regex.match('(amazing){e<=1}', 'amaging')

票数 24

Stack Overflow用户

发布于 2013-07-19 08:11:13

使用difflib.SequenceMatcher.get_matching_blocks怎么样？

>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888

>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666

更新

import difflib

def matches(large_string, query_string, threshold):
    words = large_string.split()
    for word in words:
        s = difflib.SequenceMatcher(None, word, query_string)
        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
        if len(match) / float(len(query_string)) >= threshold:
            yield match

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))

以上代码打印：['manhatan', 'manhattn']

票数 21

Stack Overflow用户

发布于 2015-06-04 21:00:24

利用乌兹进行基于阈值的模糊匹配，使用模糊搜索对匹配中的词进行模糊提取。

process.extractBests接受查询、单词列表和截止分数，并返回匹配的元组列表和高于截止分数的分数。

find_near_matches获取process.extractBests的结果，并返回单词的开始和结束索引。我使用索引来构建单词，并使用构建的单词在大字符串中找到索引。max_l_dist of find_near_matches是“Levenshtein距离”，需要根据需要进行调整。

from fuzzysearch import find_near_matches
from fuzzywuzzy import process

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

def fuzzy_extract(qs, ls, threshold):
    '''fuzzy matches 'qs' in 'ls' and returns list of 
    tuples of (word,index)
    '''
    for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
        print('word {}'.format(word))
        for match in find_near_matches(qs, word, max_l_dist=1):
            match = word[match.start:match.end]
            print('match {}'.format(match))
            index = ls.find(match)
            yield (match, index)

测试：

query_string = "manhattan"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "citi"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "greet"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

输出：

query: manhattan  
string: thelargemanhatanproject is a great project in themanhattincity  
match: manhatan  
index: 8  
match: manhattin  
index: 49  

query: citi  
string: thelargemanhatanproject is a great project in themanhattincity  
match: city  
index: 58  

query: greet  
string: thelargemanhatanproject is a great project in themanhattincity  
match: great  
index: 29

票数 21

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/17740833

复制

相似问题

问在Python中检查长字符串中存在的模糊/近似子字符串吗？
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python中检查长字符串中存在的模糊/近似子字符串吗？EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python中检查长字符串中存在的模糊/近似子字符串吗？
EN