首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >熊猫的Word2vec数据

熊猫的Word2vec数据
EN

Stack Overflow用户
提问于 2020-10-11 10:53:41
回答 1查看 5.8K关注 0票数 1

我正在尝试应用word2vec来检查数据集每一行两列的相似性。

例如:

代码语言:javascript
复制
Sent1                                     Sent2
It is a sunny day                         Today the weather is good. It is warm outside
What people think about democracy         In ancient times, Greeks were the first to propose democracy  
I have never played tennis                I do not know who Roger Feder is 

要应用word2vec,我考虑以下几点:

代码语言:javascript
复制
import numpy as np

words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:

    sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words1[0])
count = 1

for w in words1[1:]:
    sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words2[0])
count = 1
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
    sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
    count += 1
sentence2_meaning /= count

#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))

然而,这应该适用于两句话,而不是熊猫的数据。

你能告诉我在熊猫数据中应用word2vec需要做些什么来检查sent1和sent2之间的相似性吗?我想要一个关于结果的新专栏。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-10-11 11:57:36

我没有word2vec的训练和可用,所以我将展示如何用一个假的word2vec来做你想做的事情,用tfidf权重将单词转换成句子。

步骤1。准备数据

代码语言:javascript
复制
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame({"sentences": ["this is a sentence", "this is another sentence"]})

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df.sentences).todense()
vocab = tfidf.vocabulary_
vocab
{'this': 3, 'is': 1, 'sentence': 2, 'another': 0}

步骤2。有伪造的word2vec (和我们的词汇量一样大)

代码语言:javascript
复制
word2vec = np.random.randn(len(vocab),300)

步骤3.为句子计算包含word2vec的列:

代码语言:javascript
复制
sent2vec_matrix = np.dot(tfidf_matrix, word2vec) # word2vec here contains vectors in the same order as in vocab
df["sent2vec"] = sent2vec_matrix.tolist()
df

sentences   sent2vec
0   this is a sentence  [-2.098592110459085, 1.4292324332403232, -1.10...
1   this is another sentence    [-1.7879436822159966, 1.680865619703155, -2.00...

步骤4.计算相似矩阵

代码语言:javascript
复制
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(df["sent2vec"].tolist())
similarity
array([[1.        , 0.76557098],
       [0.76557098, 1.        ]])

要使word2vec工作,您需要稍微调整步骤2,以便word2vec以相同的顺序(按值或字母顺序指定)包含vocab中的所有单词。

就你的情况而言,应该是:

代码语言:javascript
复制
sorted_vocab = sorted([word for word,key in vocab.items()])
sorted_word2vec = []
for word in sorted_vocab:
    sorted_word2vec.append(word2vec[word])
票数 -1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64303203

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档