我有两张数据。Df1‘列’具有70k唯一的文本值。Df2‘列’有20个唯一的文本值。
我想通过查看df2‘列’中的20个值来找到所有70k值的最接近的同义词。并希望在df1中增加一个列,该列对该词具有最佳的同义词。
我找到了一段代码,您可以在其中进行语义搜索,并给出前5位同义词的得分。
from sentence_transformers import SentenceTransformer, util
import torch
embedder = SentenceTransformer('paraphrase-distilroberta-base-v1')
# Corpus with example sentences
corpus = ind_type_new['Industry_type_new_list'].to_list()
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = df_test['industry_types_test'][df_test['industry_types_test'] != ''].head(50)
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(1, len(corpus))
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
# We use cosine-similarity and torch.topk to find the highest 5 scores
cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
for score, idx in zip(top_results[0], top_results[1]):
print(corpus[idx], "(Score: {:.4f})".format(score))代码的输出如下*
======================
Query: Farming
Top 5 most similar sentences in corpus:
Agriculture (Score: 0.4851)
Construction (Score: 0.4436)
Manufacturing (Score: 0.4099)
Property (Score: 0.3876)
Importer (Score: 0.3616)
======================
Query: Shopping Centre
Top 5 most similar sentences in corpus:
Consumer Services (Score: 0.4105)
Hospitality (Score: 0.4089)
Business Services (Score: 0.3898)
Wholesale / Distribution (Score: 0.3863)
Retail (Score: 0.3625)
======================
Query: Retail Food
Top 5 most similar sentences in corpus:
Retail (Score: 0.7708)
Consumer Services (Score: 0.4168)
Accommodation and Food Services (Score: 0.4085)
Business Services (Score: 0.3977)
Insurance (Score: 0.3870)我所要做的就是在第一个dataframe中获得另一个列,这将是与第二个dataframe比较时该列的最佳匹配同义词。
举个例子,结果看起来就像是“工业”类型的“匹配”,“--”
请你建议我该对代码做些什么修改以得到想要的结果吗?
发布于 2021-06-02 14:21:14
假设我们要向df_test中添加一个名为"Match“的列
matches = dict() #dictionary to save the mappings
top_k=1 #because we only want the top match
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
matches[query] = corpus[int(top_results[1])]
df_test["Match"] = df_testd["industry_types_test"].map(matches) #add column to df based on dictionary values样本输出:
>>> df_test["industry_types_test"]
0 Farming
1 Shopping Centre
Name: industry_types_test, dtype: object
>> df_test[["industry_types_test", "Match"]]
industry_types_test Match
0 Farming Agriculture
1 Shopping Centre Arts and Recreation Serviceshttps://stackoverflow.com/questions/67805950
复制相似问题