文章/答案/技术大牛

发布

社区首页 >问答首页 >有什么更好的方法来训练你自己的情感分析模型，或者使用像vader和textblob这样经过预先训练的模型呢？

问有什么更好的方法来训练你自己的情感分析模型，或者使用像vader和textblob这样经过预先训练的模型呢？
EN

Stack Overflow用户

提问于 2020-07-28 05:09:25

回答 1查看 382关注 0票数 0

我有python脚本，它训练了一个用于情感分析的数据集，并使用logisticRegression模型和tfidf、交叉验证、bigram和GridSearchCV创建了一个模型。对文本执行预处理阶段。

为了比较这两种模型，我尝试使用预先训练过的模型，如VaderSentiment。

根据实际数据得出的结果是：

logisticRegression精度: 64.2%
VaderSentiment精度: 85.7%

那么，我训练的模型中的错误在哪里呢？还是用vaderSentiment来分析推特的情绪比较好？

注意，在我的训练结果中，我得到了：

Accuracy: 91.482%
Best parameters set found on development set:

{'bow__ngram_range': (1, 2), 'tfidf__use_idf': True}

Optimized model achieved an ROC of:  0.9998

LR模式：

    from sklearn.model_selection import GridSearchCV
    from sklearn import metrics
    import matplotlib.pyplot as plt
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import KFold
    from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn import model_selection
    
    from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
    
    cross_val = KFold(n_splits=3, random_state=42)
    # create pipeline
    pipeline = Pipeline([
        ('bow', CountVectorizer(strip_accents='ascii',
                                stop_words=['english'],# add or delete arabic based on the content of the tested df 
                                lowercase=True)),  # strings to token integer counts
        ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
        ('classifier', LogisticRegression(C=15.075475376884423,penalty="l2")),  # train on TF-IDF vectors w/ Naive Bayes classifier
    ])
    
    # this is where we define the values for GridSearchCV to iterate over
    parameters = {'bow__ngram_range': [(1, 1), (1, 2)],
                  'tfidf__use_idf': (True, False),
                    
                 }
    
    
    
    clf = GridSearchCV(pipeline, param_grid=parameters, cv=cross_val, verbose=1, n_jobs=-1, scoring= 'roc_auc')
    clf.fit(x_train, y_train)
    
test_twtr_preds = LR_Model.predict(test_twtr['processed_TEXT'])

VaderSentiment：

from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()

def print_sentiment_scores(text):
    snt = analyser.polarity_scores(text)  #Calling the polarity analyzer
    if snt["compound"] >= 0.05:
        snt = "positive"
    elif snt["compound"] > -0.05 and snt["compound"] < 0.05:
        snt="neutral"
    elif snt["compound"] <=0.05: 
        snt="Negative"
    return snt
def_test_twtr_preds["Vader_Process"]=def_test_twtr_preds["processed_TEXT"].apply(print_sentiment_scores)

python

logistic-regression

sentiment-analysis

pre-trained-model

回答 1

Stack Overflow用户

发布于 2020-08-30 18:47:50

在评估任何系统的精度时，一个非常有用的因素是知道集合的大小。在2000年或3000条推文中，70%的准确率与50条以上的不一样。所以千万别忘了说出数据的大小。

另一方面，你不能期望这两种算法都能给出相似的结果。您的算法是概率分类器的一部分，而Vader是基于规则的和的。这就是根本的区别，所以他们永远不会给出相似的，更不用说相同的结果了。

如果您想要提高算法的精度，您必须有非常好的分类袋的单词。一般来说，这些袋子是由开发人员编写的，他们对语言或文化因素没有深入的了解，而这些因素影响了单词的使用。我使用了一个基于统计方法的简单分类器(没有机器学习)，并且在3000条推文中获得了高达87%的准确率。考虑到我的算法在不到一分钟的时间内运行，并且不需要，不需要训练，这种差异是巨大的。

因此，您的算法不一定是错误的，但是您不能根据不同的过程和不同的数据集来比较系统。

建议:优先考虑给你提供最佳精度的东西。但是如果你想改进你的方法，从开始，提高你的书包的精度。这是钥匙。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63127392

复制

相似问题

问有什么更好的方法来训练你自己的情感分析模型，或者使用像vader和textblob这样经过预先训练的模型呢？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有什么更好的方法来训练你自己的情感分析模型，或者使用像vader和textblob这样经过预先训练的模型呢？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有什么更好的方法来训练你自己的情感分析模型，或者使用像vader和textblob这样经过预先训练的模型呢？
EN