首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将Vader情感分析写到csv的新专栏

将Vader情感分析写到csv的新专栏
EN

Stack Overflow用户
提问于 2021-07-22 20:45:15
回答 2查看 784关注 0票数 0

我有一个刮过的csv文件旅行顾问的评论。有四栏:

个人,头衔,评级,评审,review_date。

我希望这段代码执行以下操作:

在csv中,使用'pos‘、'neg’或'neut‘创建一个名为"tarate".

  • Populate "tarate“的新列
  1. 。它应该读取“评级”中的数字值。"tarate“'pos‘if”评级“>=40;"tarate’== 'neut‘if”评级“== 30”;"tarate“”== 'neg’如果SentimentIntensityAnalyzer.
  2. Record通过SentimentIntensityAnalyzer.
  3. Record运行“审查”列“--在一个新的csv列中,"scores"
  4. Create使用'pos‘和'neg’classification
  5. Run -- sklearn.metrics工具--将旅行顾问的评级("tarate")与”复合“比较。这个可以打印。

部分代码是基于http://akashsenta.com/blog/sentiment-analysis-with-vader-with-python/的

这是我的csv文件: https://github.com/nsusmann/vadersentiment

我犯了一些错误。我是一个初学者,我想我被一些东西绊倒了,比如指向特定的列和lambda函数。

以下是代码:

代码语言:javascript
复制
# open command prompt
# import nltk
# nltk.download()
# pip3 install pandas
# pip3 installs sci-kitlearn
# pip3 install matplotlib
# pip3 install seaborn
# pip3 install vaderSentiment
#pip3 install openpyxl

import pandas
import nltk
nltk.download([
    "vader_lexicon",
    "stopwords"])
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from collections import Counter
import re
import math
import html
import sklearn
import sklearn.metrics as metrics
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import openpyxl

# open the file to save the review
import csv
outputfile = open('D:\Documents\Archaeology\Projects\Patmos\TextAnalysis\Sentiment\scraped_cln_sent.csv', 'w', newline='')
df = csv.writer(outputfile)

#open Vader Sentiment Analyzer 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#make SIA into an object
analyzer = SentimentIntensityAnalyzer()

#create a new column called "tarate"
df['tarate'],
#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, ['tarate'] == 'Pos',
df.loc[df['rating'] == 30, ['tarate'] == 'Neut',
df.loc[df['rating'] <= 20, ['tarate'] == 'Neg', 

#use polarity_scores() to get sentiment metrics and write them to new column "scores"
df.head['scores'] == df['review'].apply(lambda review: sid.polarity_scores['review'])

#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])

#using column "compound", determine whether the score is <0> and write new column "score" recording positive or negative
df['score'] = df['compound'].apply(lambda score: 'pos' if score >=0 else 'neg')
ta.file()
                                           
#get accuracy metrics. this will compare the trip advisor rating (text version recorded in column "tarate") to the sentiment analysis results in column "score"
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix                                           
accuracy_score(df['tarate'],df['score'])

print(classification_report(df['tarate'],df['score']))     ```
EN

回答 2

Stack Overflow用户

发布于 2021-07-22 21:35:37

您不需要在填充新列之前创建它。此外,在行尾还有假逗号。不要这样做;用Python表示的逗号和表达式的结尾会将其转换为元组。还请记住,=是赋值运算符,==是比较运算符。

熊猫"loc“函数采用行索引器和列索引器:

代码语言:javascript
复制
#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, 'tarate'] = 'Pos'
df.loc[df['rating'] == 30, 'tarate'] = 'Neut'
df.loc[df['rating'] <= 20, 'tarate'] = 'Net'

请注意,这将使NaN (而不是数字)出现在列中,值在20到30之间,值在30到40之间。

我不知道你想做什么,但这不对:

代码语言:javascript
复制
#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])

df['scores']将不包含一个名为“复合”的列,这正是您在lambda中所要求的。

票数 1
EN

Stack Overflow用户

发布于 2021-07-22 21:42:51

我建议查阅清单理解,谷歌“熊猫应用方法”,和“熊猫兰达的例子”,以获得更多的熟悉。

对于一些示例代码:

代码语言:javascript
复制
import pandas as pd

#create a demo dataframe called 'df'
df = pd.DataFrame({'rating': [12, 42, 40, 30, 31, 56, 8, 88, 39, 79]})

这为您提供了如下所示的数据(只有一个名为“评级”的列,其中包含整数):

代码语言:javascript
复制
   rating
0      12
1      42
2      40
3      30
4      31
5      56
6       8
7      88
8      39
9      79

使用该列根据其中的值创建另一个列,可以这样做.

代码语言:javascript
复制
#create a new column called 'tarate' and using a list comprehension
#assign a string value of either 'pos', 'neut', or 'neg' based on the 
#numeric value in the 'rating' column (it does this going row by row)
df['tarate'] = ['pos' if x >= 40 else 'neut' if x == 30 else 'neg' for x in df['rating']]

#output the df
print(df)

输出:

代码语言:javascript
复制
   rating tarate
0      12    neg
1      42    pos
2      40    pos
3      30   neut
4      31    neg
5      56    pos
6       8    neg
7      88    pos
8      39    neg
9      79    pos
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68491294

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档