文章/答案/技术大牛

发布

社区首页 >问答首页 >如何用Gensim (googlenews-vectors-neative300.bin)使用已经训练过的模型嵌入数据

问如何用Gensim (googlenews-vectors-neative300.bin)使用已经训练过的模型嵌入数据
EN

Stack Overflow用户

提问于 2020-02-20 09:06:25

回答 1查看 297关注 0票数 0

我正在跟踪这个教程，其中有一个来自Quora的数据集：

在这里，我已经清理并标记了列q1_clean & q1_clean中的数据。

现在，我已经用下面的代码使用了预训练模型来训练W2vModel。

# We are concating the two columns of Question1 and Question2

nData = pd.Series(pd.concat([data['q1_clean'], data['q2_clean']]))
model_w2v = Word2Vec(nData, size=300) 

# step 2: intersect the initialized word2vec model with the pre-trained fasttext model
model_w2v.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',lockf=1.0,binary=True)

# step 3: improve model with transfer-learning using the training data
model_w2v.train(nData, total_examples=model_w2v.corpus_count, epochs= 10)

现在我必须做特征分析，因为我有下面的函数来得到平均计算的距离。

def get_pairwise_distance(word1, word2, weight1, weight2, method = 'euclidean'):
    if(word1.size==0 or word2.size==0):
        return np.nan
    dist_matrix = pairwise_distances(word1, word2, metric=method)
    return np.average(dist_matrix, weights=np.matmul(weight1.reshape(-1,1),weight2.reshape(-1,1).T))

在这里，我计算了用作权重的tfidf：

X_train_tokens = get_tokenized_questions(data=X_train)

from sklearn.feature_extraction.text import TfidfVectorizer
pass_through = lambda x:x
tfidf = TfidfVectorizer(analyzer=pass_through)
# compute tf-idf weights for the words in the training set questions
X_tfidf = tfidf.fit_transform(X_train_tokens)

# split into two
# X1_tfidf -> tf-idf weights of first question in question pair and 
# X2_tfidf -> tf-idf weights of second question in question pair
X1_tfidf = X_tfidf[:len(X_train)]
X2_tfidf = X_tfidf[len(X_train):]

我将这个get_pairwise_distance函数调用为教程中的函数。

#cosine similarities
# here X1 and X2 are the embedded versions of the first and second questions in the question-pair data
# and X1_tfidf and X2_tfidf are the tf-idf weights of the first and second questions in the question-pair data

cosine = compute_pairwise_dist(X1, X2, X1_tfidf, X2_tfidf)

对于这个函数，我需要传递嵌入式版本的、、q1_clean、、和 q2_clean 作为X1和X2，其中权重已经使用TFIDF计算。我不知道如何使用预先训练的模型将这两列嵌入到向量中，并将其传递给给定的函数？。

machine-learning

scikit-learn

nlp

nltk

gensim

回答 1

Stack Overflow用户

发布于 2020-02-21 14:10:42

您可以使用Keras Embedded Matrix。按照下面的链接。Keras嵌入式层

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60316242

复制

相似问题

问如何用Gensim (googlenews-vectors-neative300.bin)使用已经训练过的模型嵌入数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用Gensim (googlenews-vectors-neative300.bin)使用已经训练过的模型嵌入数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用Gensim (googlenews-vectors-neative300.bin)使用已经训练过的模型嵌入数据
EN