首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >TfidfVectorizer和SelectKBest出错

TfidfVectorizer和SelectKBest出错
EN

Stack Overflow用户
提问于 2020-09-25 05:29:56
回答 1查看 47关注 0票数 1

我正在尝试按照这个教程做一些情绪分析,我非常确定我的代码到这一点是完全相同的。然而,我的弓的值有很大的不同。

https://www.tensorscience.com/nlp/sentiment-analysis-tutorial-in-python-classifying-reviews-on-movies-and-products

这是我到现在为止的代码。

代码语言:javascript
复制
import nltk
import pandas as pd
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2


def openFile(path):
    #param path: path/to/file.ext (str)
    #Returns contents of file (str)
    with open(path) as file:
        data = file.read()
    return data

imdb_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/imdb_labelled.txt')
amzn_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/amazon_cells_labelled.txt')
yelp_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/yelp_labelled.txt')


datasets = [imdb_data, amzn_data, yelp_data]

combined_dataset = []
# separate samples from each other
for dataset in datasets:
    combined_dataset.extend(dataset.split('\n'))

# separate each label from each sample
dataset = [sample.split('\t') for sample in combined_dataset]


df = pd.DataFrame(data=dataset, columns=['Reviews', 'Labels'])
df = df[df["Labels"].notnull()]
df = df.sample(frac=1)


labels = df['Labels']
vectorizer = TfidfVectorizer(min_df=15)
bow = vectorizer.fit_transform(df['Reviews'])
len(vectorizer.get_feature_names())

selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)
bow = vectorizer.fit_transform(df['Reviews'])

bow

这是我的结果。

这是本教程的结果。

我一直在尝试找出可能是什么问题,但我还没有任何进展。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-09-25 06:26:59

问题是你提供的是索引,而不是一个真正的单词。

试试这个:

代码语言:javascript
复制
selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vocabulary = np.array(vectorizer.get_feature_names())[selected_features]

vectorizer = TfidfVectorizer(min_df=15, vocabulary=vocabulary) # you need to supply a real vocab here

bow = vectorizer.fit_transform(df['Reviews'])
bow
<3000x200 sparse matrix of type '<class 'numpy.float64'>'
    with 12916 stored elements in Compressed Sparse Row format>
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64054698

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档