文章/答案/技术大牛

发布

社区首页 >问答首页 >CountVectorizer忽略了“我”

问CountVectorizer忽略了“我”
EN

Stack Overflow用户

提问于 2015-10-21 13:22:39

回答 1查看 1.9K关注 0票数 10

为什么CountVectorizer在滑雪板中忽略代词"I"？

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
<1x3 sparse matrix of type '<class 'numpy.int64'>'
ngram_vectorizer.get_feature_names()
['gave it', 'he gave', 'it to']

python

scikit-learn

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-10-21 16:34:55

默认标记器只考虑两个字符(或更多)字。

您可以通过将适当的token_pattern传递给CountVectorizer来更改此行为。

默认模式是(请参阅文档中的签名)：

'token_pattern': u'(?u)\\b\\w\\w+\\b'

例如，您可以通过更改默认值获得一个不删除一个字母单词的CountVectorizer：

from sklearn.feature_extraction.text import CountVectorizer
ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), 
                                   token_pattern=u"(?u)\\b\\w+\\b",min_df=1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
print(ngram_vectorizer.get_feature_names())

这意味着：

['gave it', 'he gave', 'it to', 'to i']

票数 12

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/33260505

复制

相似问题

问CountVectorizer忽略了“我”
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问CountVectorizer忽略了“我”EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问CountVectorizer忽略了“我”
EN