我正在使用Python处理20news using数据集。在它上使用CountVectorizer,然后使用gensim api来增加词频。我试着拟合它,但得到了这个错误。
下面是我的代码:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2000)
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
X_train_counts = count_vect.fit_transform(twenty_train.data)
from gensim.sklearn_api import TfIdfTransformer
model = TfIdfTransformer(smartirs='atn')
tfidf_aug = model.fit_transform(X_train_counts())在运行上面的代码后,我得到了这个错误:
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
在末尾使用getnz()后,如下所示。
tfidf_aug = model.fit_transform(X_train_counts().getnnz())我得到了这个错误:
TypeError: 'int' object is not iterable
发布于 2019-01-19 16:48:52
正如前面提到的,TfidfTransformer的输入必须是(int,int)的迭代器。因此,在将稀疏矩阵转换为gensim模型之前,您必须对其进行处理。
尝尝这个
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
from gensim.sklearn_api import TfIdfTransformer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2000)
X_train_counts = count_vect.fit_transform(twenty_train.data)
model = TfIdfTransformer(smartirs='atn')
tfidf_aug = model.fit_transform([[(i,j) for i,j in zip(a.data,a.indices)] for a in X_train_counts ])https://stackoverflow.com/questions/53821303
复制相似问题