文章/答案/技术大牛

发布

社区首页 >问答首页 >比较两个字符串的低/无一致性

问比较两个字符串的低/无一致性
EN

Stack Overflow用户

提问于 2022-03-23 05:11:35

回答 1查看 188关注 0票数 1

我有两根线

a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'

我需要检查他们是否匹配，并打印出相应的结果。由于b是我希望签入的字符串a，因此结果应该打印出“匹配”或“Mis-match”。代码不应该依赖于a中的'_‘，因为它们也可以是'-’或空格。我试着用fuzzywuzzy库和fuzzy.token_set_ratio来计算比率。从观察来看，我选择了一个95的价值是令人信服的。我想知道是否有另一种方法来检查这一点，而不用fuzzywuzzy，可能是difflib。我试着使用difflib和SequenceManager，但我得到的只是一个词明智的比较，无法准确地组合结果。

我试过以下代码。

from fuzzywuzzy import fuzzy
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'
ratio = fuzzy.token_set_ratio(a.lower(), b.lower())
if ratio >= 95:
    print('Match')
else:
    print('Mis-Match')

输出

'Mis-Match'

这给出了一个64的分数，而所有的控制，pos0和pos1都在a和b中，应该给一个匹配。

我试过这样做，因为它不依赖于'_‘或'-’或空格。

python

string

fuzzywuzzy

difflib

回答 1

Stack Overflow用户

发布于 2022-03-23 06:18:30

您可以使用gensim库实现MatchSemantic并将如下代码编写为一个函数：

初始化

如果第一次运行代码，进程条将从0%转到100%，用于下载gensim的glove-wiki-gigaword-50，然后将设置所有内容，您只需运行代码即可。

代码

from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity

def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
        doc = sub(r'<[^<>]+(>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus + [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    doc_similarity_scores = index[query_tf]

    # Output the sorted similarity scores and documents
    sorted_indexes = np.argsort(doc_similarity_scores)[::-1]

    for idx in sorted_indexes:
        if documents[idx] != '':
            if doc_similarity_scores[idx] > 0.0: print('Match')
            else: print('Mis-Match')

用法

例如，我们希望查看Fruit and Vegetables是否与documents中的任何句子或项匹配。

测试：

query_string = 'Fruit and Vegetables'
documents = ['I have an apple in my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

因此，我们知道第一项I have an apple in my basket与Fruit and Vegetables有语义关系，因此它打印Match，而对于第二项，不会发现任何关系，因此它会打印Mis-Match。

输出：

Match
Mis-Match

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71582103

复制

相似问题

问比较两个字符串的低/无一致性
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问比较两个字符串的低/无一致性EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问比较两个字符串的低/无一致性
EN