文章/答案/技术大牛

发布

社区首页 >问答首页 >对于好的和不好的电影评论的情感分析，RandomForestClassifier仅为50%

问对于好的和不好的电影评论的情感分析，RandomForestClassifier仅为50%
EN

Stack Overflow用户

提问于 2020-01-08 11:52:10

回答 3查看 288关注 0票数 1

我正在尝试训练一个RandomForestClassifier来根据字数来预测评论是好的(1)还是坏的(0)。

我的名为all_train_set的训练数据如下所示：

                                                 Reviews  Labels
0      For fans of Chris Farley, this is probably his...       1
1      Fantastic, Madonna at her finest, the film is ...       1
2      From a perspective that it is possible to make...       1
3      What is often neglected about Harold Lloyd is ...       1
4      You'll either love or hate movies such as this...       1
                                              ...     ...
14995  This is perhaps the worst movie I have ever se...       0
14996  I was so looking forward to seeing this film t...       0
14997  It pains me to see an awesome movie turn into ...       0
14998  "Grande Ecole" is not an artful exploration of...       0
14999  I felt like I was watching an example of how n...       0

测试数据集看起来格式完全相同。用于训练我的算法的代码如下：

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier as rfc

stopwords=set(nltk.corpus.stopwords.words('english'))



tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, stop_words=stopwords)
X = tfidfconverter.fit_transform(all_train_set['Reviews']).toarray()
X_train = X
y_train = all_train_set['Labels']

classifier = rfc(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train)

tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, stop_words=stopwords)
X = tfidfconverter.fit_transform(all_test_set['Reviews']).toarray()
X_test = X
y_test = all_test_set['Labels']

#predicting on the test set and printing results
y_pred = classifier.predict(X_test)

print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

我的总体准确率是0.5，这似乎非常差。在此之后，我尝试了网格搜索以获得最佳参数，但总体精度再次精确到0.5。结果如下所示：

              precision    recall  f1-score   support
           0       0.50      0.70      0.58      2482
           1       0.50      0.30      0.37      2482
    accuracy                           0.50      4964
   macro avg       0.50      0.50      0.48      4964
weighted avg       0.50      0.50      0.48      4964

0.5

如果有人能解释这是一个训练错误，还是仅仅是一个糟糕的结果。如果后者存在，我可以知道如何改进它吗？

我是机器学习的新手，所以如果有任何不清楚的地方，我很抱歉，我很乐意澄清/编辑/接受关于如何改进我的问题的建议。

非常感谢

python

scikit-learn

random-forest

回答 3

Stack Overflow用户

发布于 2020-01-08 13:21:53

您可以尝试以下几种方法：

你不应该在测试集上使用fit_transform，只需要transform。所以不要重新初始化，你可以尝试不同的数据清理，超参数调整， tfidfconverter.
You ，。在文本数据上，像LinearSVC这样的算法工作得很好。但你可以在上面做实验。
从文本创建新功能。在github或Kaggle内核中查找相关示例。

票数 0

Stack Overflow用户

发布于 2020-01-08 14:54:02

首先，正如您所定义的：

X_train = X
X_test = X

这是错误的，监督学习算法(分类问题)总是有训练/测试分离。所以你训练你的算法，然后用看不见的数据来测试它！您可以找到here的scikit函数。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

正如Shwetea已经提到的，你也不应该fit_transform你的测试数据，你应该只使用transform，否则你应该再次使用你的测试集的信息，而这些信息对于算法来说是不应该知道的。

要知道你的分数有多高，你可以使用Dummy Classifier，这将总是预测主要类别，例如，如果你的标签1只有你标签的10%，那么虚拟分数将是90% (因为它总是预测0)。如果你的虚拟分数接近或高于0.5，你的算法真的很糟糕。

票数 0

Stack Overflow用户

发布于 2020-01-08 15:54:41

似乎在用于训练模型的特征和标签之间存在非常弱的关系(或没有关系)。因此，您的模型的性能非常差。这在机器学习实验中非常常见。

我建议您首先专注于生成更大的特性集(从学术论文或其他类似项目中借用)。然后，您可以使用特征选择方法来选择最佳特征。

您可以查看以下内容：

https://medium.com/@MarynaL/analyzing-movie-review-data-with-natural-language-processing-7c5cba6ed922

https://github.com/deepakchaudhari705/Sentimental-Analysis-on-a-Movie-Data-/blob/master/sentiment-checkpoint.ipynb

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59639180

复制

相似问题

问对于好的和不好的电影评论的情感分析，RandomForestClassifier仅为50%
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对于好的和不好的电影评论的情感分析，RandomForestClassifier仅为50%EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对于好的和不好的电影评论的情感分析，RandomForestClassifier仅为50%
EN