问将text2vec嵌入应用于新数据
EN

Stack Overflow用户

提问于 2017-02-02 21:20:17

回答 1查看 938关注 0票数 2

我使用text2vec从包含大量行业特定行话的专有文本数据集生成自定义的word嵌入(因此，与谷歌提供的嵌入一样的股票嵌入是行不通的)。类比效果很好，但我很难应用嵌入来评估新的数据。我想使用我已经训练过的嵌入来理解新数据中的关系。我使用的方法(下面描述)似乎很复杂，而且速度慢得令人痛苦。有没有更好的方法？也许包里已经有我错过的东西了？

下面是我的方法(在我使用专有数据源的情况下，提供的代码最接近可复制代码)：

D=载有新数据的清单。每个元素都具有类特征。

vecs =从text2vec的手套实现中获得的单词向量化

  new_vecs <- sapply(d, function(y){             
                    it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
                    voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab 
                    vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then 
                    colMeans # find the average vector for each document
                    })  %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement

对于我的用例，我需要将每个文档的结果分开，这样任何涉及粘贴的东西--d的元素--都不会起作用，但是肯定有比我拼凑在一起的方法更好的方法。我觉得我一定是错过了一些很明显的东西。

任何帮助都将不胜感激。

text2vec

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-02-03 06:38:59

您需要在“批处理”模式下使用高效的线性代数矩阵操作。其思想是为文档d建立文档术语矩阵。此矩阵将包含每个文档中每个单词出现多少次的信息。然后只需将dtm乘以嵌入矩阵：

library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>% 
  create_dtm(vectorizer)

# normalize - calculate term frequaency - i.e. divide count of each word 
# in document by total number of words in document. 
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs

票数 7

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42012496

复制

相似问题

问将text2vec嵌入应用于新数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将text2vec嵌入应用于新数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将text2vec嵌入应用于新数据
EN