我有两根线
a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'我需要检查他们是否匹配,并打印出相应的结果。由于b是我希望签入的字符串a,因此结果应该打印出“匹配”或“Mis-match”。代码不应该依赖于a中的'_‘,因为它们也可以是'-’或空格。我试着用fuzzywuzzy库和fuzzy.token_set_ratio来计算比率。从观察来看,我选择了一个95的价值是令人信服的。我想知道是否有另一种方法来检查这一点,而不用fuzzywuzzy,可能是difflib。我试着使用difflib和SequenceManager,但我得到的只是一个词明智的比较,无法准确地组合结果。
我试过以下代码。
from fuzzywuzzy import fuzzy
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'
ratio = fuzzy.token_set_ratio(a.lower(), b.lower())
if ratio >= 95:
print('Match')
else:
print('Mis-Match')输出
'Mis-Match'这给出了一个64的分数,而所有的控制,pos0和pos1都在a和b中,应该给一个匹配。
我试过这样做,因为它不依赖于'_‘或'-’或空格。
发布于 2022-03-23 06:18:30
您可以使用gensim库实现MatchSemantic并将如下代码编写为一个函数:
初始化
如果第一次运行代码,进程条将从0%转到100%,用于下载gensim的glove-wiki-gigaword-50,然后将设置所有内容,您只需运行代码即可。
代码
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
def MatchSemantic(query_string, documents):
stopwords = ['the', 'and', 'are', 'a']
if len(documents) == 1: documents.append('')
def preprocess(doc):
# Tokenize, clean up input document string
doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
doc = sub(r'<[^<>]+(>|$)', " ", doc)
doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)
# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus + [query])
tfidf = TfidfModel(dictionary=dictionary)
# Create the term similarity matrix.
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(
tfidf[[dictionary.doc2bow(document) for document in corpus]],
similarity_matrix)
doc_similarity_scores = index[query_tf]
# Output the sorted similarity scores and documents
sorted_indexes = np.argsort(doc_similarity_scores)[::-1]
for idx in sorted_indexes:
if documents[idx] != '':
if doc_similarity_scores[idx] > 0.0: print('Match')
else: print('Mis-Match')用法
例如,我们希望查看Fruit and Vegetables是否与documents中的任何句子或项匹配。
测试:
query_string = 'Fruit and Vegetables'
documents = ['I have an apple in my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)因此,我们知道第一项I have an apple in my basket与Fruit and Vegetables有语义关系,因此它打印Match,而对于第二项,不会发现任何关系,因此它会打印Mis-Match。
输出:
Match
Mis-Matchhttps://stackoverflow.com/questions/71582103
复制相似问题