因此,我使用keras手工计算了一个字向量矩阵,如下所示:
>>> word_embeddings
0 1 2 3
movie 0.007964 0.004251 -0.049078 0.032954 ...
film -0.006703 0.045888 -0.020975 0.012483 ...
one -0.011733 0.003348 -0.022017 -0.006476 ...
make 0.045888 -0.011219 0.037796 -0.041868 ...
1000 rows × 25 columns我现在想要的是得到与给定输入词最相似的n单词,例如。input='movie' -> output=['film', 'cinema', ...]
我计算了一个欧几里德距离矩阵,但是如何得到上面的结果呢?
>>> from sklearn.metrics.pairwise import euclidean_distances
>>> distance_matrix = euclidean_distances(word_embeddings)
array([[0. , 2.4705646, 2.363872 , ..., 3.1345532, 2.9737253,
2.791427 ],
[2.4705646, 0. , 2.3540049, ..., 3.6580865, 3.4589343,
3.494087 ],
[2.363872 , 2.3540049, 0. , ..., 3.9583569, 3.692863 ,
3.5237448],
...,
[3.1345532, 3.6580865, 3.9583569, ..., 0. , 4.0572405,
4.0648513],
[2.9737253, 3.4589343, 3.692863 , ..., 4.0572405, 0. ,
4.156624 ],
[2.791427 , 3.494087 , 3.5237448, ..., 4.0648513, 4.156624 ,
0. ]], dtype=float32)
1000 rows × 1000 columns发布于 2022-07-14 09:45:28
试试这个:
top_k_similar_indexes = np.argsort(distance_matrix, axis=1)[:, :k]然后,对于每一行,您将得到k顶部类似单词的索引。如果您想要k顶部最不同的单词的索引,它将是np.argsort(distance_matrix, axis=1)[:, -k:]
发布于 2022-07-14 11:08:28
下面是一个端到端的例子。你已经有了这段代码
import numpy as np
import pandas as pd
df = pd.DataFrame({0:[0.9, 0.5, 0.3], 1: [0.2, 0.3, 0.1], 2:[0.4, 0.6, 0.1]}, index=['a', 'b', 'c'])
# Computing distance
from sklearn.metrics.pairwise import euclidean_distances
distance_matrix = euclidean_distances(df)然后执行以下操作
k=2
# Note the range of 1:k+1
# You need to discard the first column
# as that would have the index of the same input word
# because distance between the same word is the minimum (i.e. 0)
top_k_ids = np.argsort(distance_matrix, axis=-1)[:, 1:k+1]
# The input words
inputs = df.index.values
# We go through each column of top_k_ids and index inputs
# and stack those results on columns axis
outputs = np.stack([inputs[top_k_ids[:, i]] for i in range(k)], axis=-1)这提供了以下内容,如outputs
array([['b', 'c'],
['a', 'c'],
['b', 'a']], dtype=object)https://stackoverflow.com/questions/72978354
复制相似问题