文章/答案/技术大牛

发布

社区首页 >问答首页 >CountVectorizer:词汇表不合适

问CountVectorizer:词汇表不合适
EN

Stack Overflow用户

提问于 2015-09-20 07:59:26

回答 1查看 22.5K关注 0票数 14

我通过vocabulary参数传递词汇表实例化了一个sklearn.feature_extraction.text.CountVectorizer对象，但得到了一条sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted.错误消息。为什么？

示例：

import sklearn.feature_extraction
import numpy as np
import pickle

# Save the vocabulary
ngram_size = 1
dictionary_filepath = 'my_unigram_dictionary'
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=1)

corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]

vect = vectorizer.fit(corpus)
print('vect.get_feature_names(): {0}'.format(vect.get_feature_names()))
pickle.dump(vect.vocabulary_, open(dictionary_filepath, 'w'))

# Load the vocabulary
vocabulary_to_load = pickle.load(open(dictionary_filepath, 'r'))
loaded_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=1, vocabulary=vocabulary_to_load)
print('loaded_vectorizer.get_feature_names(): {0}'.format(loaded_vectorizer.get_feature_names()))

输出：

vect.get_feature_names(): [u'and', u'document', u'first', u'is', u'one', u'right', u'second', u'the', u'third', u'this']
Traceback (most recent call last):
  File "C:\Users\Francky\Documents\GitHub\adobe\dstc4\test\CountVectorizerSaveDic.py", line 22, in <module>
    print('loaded_vectorizer.get_feature_names(): {0}'.format(loaded_vectorizer.get_feature_names()))
  File "C:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 890, in get_feature_names
    self._check_vocabulary()
  File "C:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 271, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),
  File "C:\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 627, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

python

nlp

scikit-learn

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-09-20 08:06:24

由于某些原因，即使您将vocabulary=vocabulary_to_load作为sklearn.feature_extraction.text.CountVectorizer()的参数传递，您仍然需要在能够调用loaded_vectorizer.get_feature_names()之前调用loaded_vectorizer._validate_vocabulary()。

因此，在您的示例中，当使用您的词汇表创建CountVectorizer对象时，您应该执行以下操作：

vocabulary_to_load = pickle.load(open(dictionary_filepath, 'r'))
loaded_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,
                                        ngram_size), min_df=1, vocabulary=vocabulary_to_load)
loaded_vectorizer._validate_vocabulary()
print('loaded_vectorizer.get_feature_names(): {0}'.
  format(loaded_vectorizer.get_feature_names()))

票数 17

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32674380

复制

相似问题

问CountVectorizer:词汇表不合适
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问CountVectorizer:词汇表不合适EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问CountVectorizer:词汇表不合适
EN