文章/答案/技术大牛

发布

社区首页 >问答首页 >寻找一种有效的NLP短语嵌入模型

问寻找一种有效的NLP短语嵌入模型
EN

Stack Overflow用户

提问于 2020-09-11 08:53:05

回答 1查看 1.2K关注 0票数 3

我想要达到的目标是找到一个很好的word_and_phrase嵌入模型：(1)对于我感兴趣的单词和短语，它们有嵌入。(2)我可以用嵌入来比较两种事物(可以是词或短语)之间的相似性。

到目前为止，我已经尝试了两种方法：

1:例如，一些满载Gensim的预培训模型：

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
# download the model and return as object ready for use
model_glove_twitter = api.load("fasttext-wiki-news-subwords-300")
model_glove_twitter.similarity('computer-science', 'machine-learning')

这条路的问题是，我不知道一个短语是否有嵌入。对于这个例子，我得到了一个错误：

KeyError: "word 'computer-science' not in vocabulary"

我将不得不尝试不同的预培训模式，比如word2vec-google-news-300，glove-wiki-gigaword-300，glove-twitter-200等等。结果是相似的，总是有兴趣的短语没有嵌入。

然后，我尝试使用一些基于伯特的句子嵌入方法：https://github.com/UKPLab/sentence-transformers.。

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

from scipy.spatial.distance import cosine

def cosine_similarity(embedding_1, embedding_2):
    # Calculate the cosine similarity of the two embeddings.
    sim = 1 - cosine(embedding_1, embedding_2)
    print('Cosine similarity: {:.2}'.format(sim))

phrase_1 = 'baby girl'
phrase_2 = 'annual report'
embedding_1 = model.encode(phrase_1)
embedding_2 = model.encode(phrase_2)
cosine_similarity(embedding_1[0], embedding_2[0])

使用这种方法，我可以得到我的短语嵌入，但相似性评分为0.93，这似乎是不合理的。

那么，我还能做些什么来实现上述两个目标呢？

word2vec

fasttext

nlp

gensim

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-09-14 08:40:43

第一条路径的问题是，您正在加载fastText嵌入，比如word2vec嵌入和word2vec无法处理词汇外单词。

好的是fastText可以管理OOV单词。您可以使用Facebook原始实现(pip install fasttext)或Gensim实现。

例如，使用Facebook实现，您可以：

import fasttext
import fasttext.util

# download an english model
fasttext.util.download_model('en', if_exists='ignore')  # English
model = fasttext.load_model('cc.en.300.bin')

# get word embeddings
# (if instead you want sentence embeddings, use get_sentence_vector method)
word_1='computer-science'
word_2='machine-learning'
embedding_1=model.get_word_vector(word_1)
embedding_2=model.get_word_vector(word_2)

# compare the embeddings
cosine_similarity(embedding_1, embedding_2)

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63843793

复制

相似问题

问寻找一种有效的NLP短语嵌入模型
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问寻找一种有效的NLP短语嵌入模型EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问寻找一种有效的NLP短语嵌入模型
EN