当我使用sense2vec中提供的most_similar方法时,我会打印出整个词汇表列表。我认为这是不正确的工作。举个例子,我给50000000只是为了测试"decrease|VERB",我有188325个列表
import spacy
from sense2vec import Sense2Vec
from sense2vec import Sense2VecComponent
nlp = spacy.load("en_core_web_sm")
s2v = Sense2Vec().from_disk("./s2v_old/")
most_similar = s2v.most_similar("decrease|VERB", n=50000000)
j =sorted(list(set([i.lower() for i in [' '.join(i[0].split('|')[0].split('_')) for i in most_similar] if i.isalpha()])))
print(len(j)) # 188325
print(j[:100])
['a',
'aa',
'aaa',
'aaaa',
'aaaaa',
'aaaaaa',
'aaaaaaa',
'aaaaaaaa',
'aaaaaaaaa',
'aaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaand',...]这不是减少的意思。我认为概率计算忽略了这一事实,这是一个错误。
发布于 2019-12-18 15:55:55
如果你请求最相似的50000000个单词,你会得到整个词汇表。尝试使用n=3或n=10:
s2v.most_similar("decrease|VERB", n=10)
# [('increase|VERB', 0.961), ('decreasing|VERB', 0.9295), ('increasing|VERB', 0.9273), ('decreases|VERB', 0.9251), ('increases|VERB', 0.9062), ('reducing|VERB', 0.904), ('increases|NOUN', 0.8928), ('decrease|NOUN', 0.8826), ('decreases|NOUN', 0.8751), ('reduce|VERB', 0.87)]请注意,结果已经按相似度分数递减排序。
如果您试图将字符串“method”与字符串"50000000“进行比较,那么这不是正确的方法。这里有一个指向使用指南的链接:https://github.com/explosion/sense2vec#-quickstart
https://stackoverflow.com/questions/59377013
复制相似问题