文章/答案/技术大牛

发布

社区首页 >问答首页 >将令牌传递给CountVectorizer

问将令牌传递给CountVectorizer
EN

Stack Overflow用户

提问于 2016-03-08 12:32:14

回答 3查看 14.8K关注 0票数 14

我有一个文本分类问题，其中有两种类型的特性：

N-克特征(由CountVectorizer提取)
其他文本特征(例如，某个给定词汇中的一个单词)。这些特性不同于n克，因为它们应该是从文本中提取的任何n克的一部分。

这两种类型的特征都是从文本的标记中提取的。我只想运行一次令牌化，然后将这些令牌传递给CountVectorizer和其他存在特性提取器。因此，我想将一个令牌列表传递给CountVectorizer，但它只接受一个字符串作为表示到某个示例。有办法传递一个令牌数组吗？

scikit-learn

tokenize

回答 3

Stack Overflow用户

发布于 2018-10-17 12:45:52

总结@ this 126350和@miroli的答案和这个链接

from sklearn.feature_extraction.text import CountVectorizer

def dummy(doc):
    return doc

cv = CountVectorizer(
        tokenizer=dummy,
        preprocessor=dummy,
    )  

docs = [
    ['hello', 'world', '.'],
    ['hello', 'world'],
    ['again', 'hello', 'world']
]

cv.fit(docs)
cv.get_feature_names()
# ['.', 'again', 'hello', 'world']

要记住的一件事是，在调用transform()函数之前，将新的令牌化文档封装到一个列表中，这样就可以将其作为单个文档来处理，而不是将每个令牌解释为文档：

new_doc = ['again', 'hello', 'world', '.']
v_1 = cv.transform(new_doc)
v_2 = cv.transform([new_doc])

v_1.shape
# (4, 4)

v_2.shape
# (1, 4)

票数 20

Stack Overflow用户

发布于 2016-08-17 01:10:47

通常，您可以将自定义tokenizer参数传递给CountVectorizer。令牌程序应该是一个函数，它接受一个字符串并返回其令牌的数组。但是，如果您已经在数组中拥有了令牌，那么您可以使用一些任意的键制作一个令牌数组的字典，并让您的令牌程序从该字典返回。然后，当您运行CountVectorizer时，只需传递字典键即可。例如,

 # arbitrary token arrays and their keys
 custom_tokens = {"hello world": ["here", "is", "world"],
                  "it is possible": ["yes it", "is"]}

 CV = CountVectorizer(
      # so we can pass it strings
      input='content',
      # turn off preprocessing of strings to avoid corrupting our keys
      lowercase=False,
      preprocessor=lambda x: x,
      # use our token dictionary
      tokenizer=lambda key: custom_tokens[key])

 CV.fit(custom_tokens.keys())

票数 3

Stack Overflow用户

发布于 2018-05-04 13:09:40

类似于answer 126350的答案，但更简单的是，下面是我所做的。

def do_nothing(tokens):
    return tokens

pipe = Pipeline([
    ('tokenizer', MyCustomTokenizer()),
    ('vect', CountVectorizer(tokenizer=do_nothing,
                             preprocessor=None,
                             lowercase=False))
])

doc_vects = pipe.transform(my_docs)  # pass list of documents as strings

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/35867484

复制

相似问题

问将令牌传递给CountVectorizer
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将令牌传递给CountVectorizerEN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将令牌传递给CountVectorizer
EN