文章/答案/技术大牛

发布

社区首页 >问答首页 >求词向量矩阵中最相似的词

问求词向量矩阵中最相似的词
EN

Stack Overflow用户

提问于 2022-07-14 09:35:14

回答 2查看 111关注 0票数 1

因此，我使用keras手工计算了一个字向量矩阵，如下所示：

>>> word_embeddings

        0           1           2           3 
movie   0.007964    0.004251    -0.049078   0.032954    ...
film    -0.006703   0.045888    -0.020975   0.012483    ...
one     -0.011733   0.003348    -0.022017   -0.006476   ...
make    0.045888    -0.011219   0.037796    -0.041868   ...

1000 rows × 25 columns

我现在想要的是得到与给定输入词最相似的n单词，例如。input='movie' -> output=['film', 'cinema', ...]

我计算了一个欧几里德距离矩阵，但是如何得到上面的结果呢？

>>> from sklearn.metrics.pairwise import euclidean_distances
>>> distance_matrix = euclidean_distances(word_embeddings)

array([[0.       , 2.4705646, 2.363872 , ..., 3.1345532, 2.9737253,
        2.791427 ],
       [2.4705646, 0.       , 2.3540049, ..., 3.6580865, 3.4589343,
        3.494087 ],
       [2.363872 , 2.3540049, 0.       , ..., 3.9583569, 3.692863 ,
        3.5237448],
       ...,
       [3.1345532, 3.6580865, 3.9583569, ..., 0.       , 4.0572405,
        4.0648513],
       [2.9737253, 3.4589343, 3.692863 , ..., 4.0572405, 0.       ,
        4.156624 ],
       [2.791427 , 3.494087 , 3.5237448, ..., 4.0648513, 4.156624 ,
        0.       ]], dtype=float32)

1000 rows × 1000 columns

python

pandas

keras

word2vec

euclidean-distance

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-07-14 09:45:28

试试这个：

top_k_similar_indexes = np.argsort(distance_matrix, axis=1)[:, :k]

然后，对于每一行，您将得到k顶部类似单词的索引。如果您想要k顶部最不同的单词的索引，它将是np.argsort(distance_matrix, axis=1)[:, -k:]

票数 2

Stack Overflow用户

发布于 2022-07-14 11:08:28

下面是一个端到端的例子。你已经有了这段代码

import numpy as np
import pandas as pd

df = pd.DataFrame({0:[0.9, 0.5, 0.3], 1: [0.2, 0.3, 0.1], 2:[0.4, 0.6, 0.1]}, index=['a', 'b', 'c'])

# Computing distance
from sklearn.metrics.pairwise import euclidean_distances
distance_matrix = euclidean_distances(df)

然后执行以下操作

k=2

# Note the range of 1:k+1 
# You need to discard the first column 
# as that would have the index of the same input word 
# because distance between the same word is the minimum (i.e. 0) 
top_k_ids = np.argsort(distance_matrix, axis=-1)[:, 1:k+1]

# The input words
inputs = df.index.values

# We go through each column of top_k_ids and index inputs 
# and stack those results on columns axis
outputs = np.stack([inputs[top_k_ids[:, i]] for i in range(k)], axis=-1)

这提供了以下内容，如outputs

array([['b', 'c'],
       ['a', 'c'],
       ['b', 'a']], dtype=object)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72978354

复制

相似问题

问求词向量矩阵中最相似的词
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问求词向量矩阵中最相似的词EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问求词向量矩阵中最相似的词
EN