文章/答案/技术大牛

发布

社区首页 >问答首页 >如何向gensim字典添加标记

问如何向gensim字典添加标记
EN

Stack Overflow用户

提问于 2014-06-12 15:33:21

回答 3查看 4.8K关注 0票数 4

我使用gensim从一组文档中构建字典。每个文档都是一个令牌列表。这是我的代码

def constructModel(self, docTokens):
    """ Given document tokens, constructs the tf-idf and similarity models"""

    #construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs
    #print "dictionary"
    self.dictionary = corpora.Dictionary(docTokens)

    # prune dictionary: remove words that appear too infrequently or too frequently
    print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values())
    #self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000)
    #self.dictionary.compactify()

    print "dictionary size after filter_extremes:",self.dictionary

    #construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs
    corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens]


    #construct the tf-idf model 
    self.model = models.TfidfModel(corpus_bow,normalize=True)
    corpus_tfidf = self.model[corpus_bow]   # first transform each raw bow vector in the corpus to the tfidf model's vector space
    self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf)  # construct the term-document index

我的问题是如何向这个字典添加一个新的文档(令牌)并更新它。我在gensim文档中搜索，但没有找到解决方案

topic-modeling

topicmodels

python

gensim

回答 3

Stack Overflow用户

发布于 2014-09-23 08:29:08

gensim网页here上提供了有关如何执行此操作的文档

方法是使用新文档创建另一个字典，然后合并它们。

from gensim import corpora

dict1 = corpora.Dictionary(firstDocs)
dict2 = corpora.Dictionary(moreDocs)
dict1.merge_with(dict2)

根据文档，这将“将相同的令牌映射到相同的ids，将新的令牌映射到新的ids”。

票数 7

Stack Overflow用户

发布于 2017-07-17 18:24:07

您可以使用add_documents方法：

from gensim import corpora
text = [["aaa", "aaa"]]
dictionary = corpora.Dictionary(text)
dictionary.add_documents([['bbb','bbb']])
print(dictionary)

运行上面的代码后，您将获得以下代码：

Dictionary(2 unique tokens: ['aaa', 'bbb'])

有关更多详细信息，请阅读document。

票数 2

Stack Overflow用户

发布于 2020-11-25 00:21:15

方法1：

您可以只使用gensim.models.keyedvectors中的keyedvectors。它们非常容易使用。

from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)

方法二：

和如果你已经用gensim.models.Word2Vec构建了一个模型，你可以这样做。假设我想用一个随机向量添加令牌<UKN>。

model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length

完整的示例如下所示：

import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec

dataset = api.load("text8")  # load dataset as iterable
model = Word2Vec(dataset)

model.wv["<UNK>"] = np.random.rand(100)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24178843

复制

相似问题

问如何向gensim字典添加标记
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何向gensim字典添加标记EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何向gensim字典添加标记
EN