我传递两个字符串,例如:$1-2$ $3-4$ 5-6$ & $7-8$ $9-10$ $10-11$
在这种情况下,count_vocab函数抛出一个错误:
empty vocabulary: perhaps the document contains only stop words"那么它对$符号有问题吗?
难道它不把1-2美元当作一种象征吗?
发布于 2014-07-24 13:42:12
令牌的定义由CountVectorizer构造函数的CountVectorizer参数(正则表达式)确定:
Regular expression denoting what constitutes a "token", only used
if `tokenize == 'word'`. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).这显然与您所拥有的不匹配,因此为您的数据定义一个不同的RE。
https://stackoverflow.com/questions/24909481
复制相似问题