文章/答案/技术大牛

发布

社区首页 >问答首页 >如何利用CountVectorizer提取TF？

问如何利用CountVectorizer提取TF？
EN

Stack Overflow用户

提问于 2018-11-06 08:59:49

回答 1查看 776关注 0票数 0

如何获得sklearn.feature_extraction.text.CountVectorizer创建的词汇表中每个术语的词频(TF)，并将它们放入列表或字典中？

似乎所有与词汇表中键对应的值都小于max_features，这是我在初始化CountVectorizer时手动设置的，而不是TF--应该是一个浮点数。有人能帮我吗？

CV=CountVectorizer(ngram_range(ngram_min_file_opcode,ngram_max_file_opcode), 
                   decode_error="ignore", max_features=max_features_file_re,
                   token_pattern=r'\b\w+\b', min_df=1, max_df=1.0) 
x = CV.fit_transform(x).toarray()

machine-learning

scikit-learn

nlp

tfidfvectorizer

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-11-06 09:28:57

如果您期望浮动值，您可能正在寻找过渡联邦国防军。在这种情况下，可以使用extraction.text.TfidfVectorizer或extraction.text.CountVectorizer，然后是extraction.text.TfidfTransformer，

如果您实际上只需要TF，仍然可以使用TfidfVectorizer或CountVectorizer，然后使用TfidfTransformer，只需确保将TfidfVectorizer/Transformer的use_idf参数设置为False，将norm (规范化)参数设置为'l1'或'l2'。这将使TF计数正常化。

来自SKLearn文档：

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())  
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

行[0 1 1 1 0 0 1 0 1]对应于第一个文档。第一个元素对应于文档中发生了多少次and、第二个document、第三个first等等。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53168622

复制

相似问题

问如何利用CountVectorizer提取TF？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何利用CountVectorizer提取TF？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何利用CountVectorizer提取TF？
EN