我在这个网站上找到了一些代码:(https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed),用来在推特上进行情绪分析。我有我需要的csv文件,所以我没有构建它们,而是通过文件定义了变量。
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
并追溯到这一行:
processedTweets.append((self._processTweet(tweet"text"),推特“label”))。
我不知道如何绕过这个问题,同时仍然保持代码的核心功能不变。
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import twitter
import csv
import time
import nltk
nltk.download('stopwords')
testDataSet = pd.read_csv("Twitter data.csv")
print(testDataSet[0:4])
trainingData = pd.read_csv("full-corpus.csv")
print(trainingData[0:4])
class PreProcessTweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
def processTweets(self, list_of_tweets):
processedTweets=[]
for tweet in list_of_tweets:
processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
return processedTweets
def _processTweet(self, tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)发布于 2019-05-16 17:46:08
没有你的实际数据很难判断,但我认为你通过彼此混淆了多种类型。
’和'label‘来访问'tweet’的值。不过,我怀疑你有没有字典。
我从这个site下载了一些推文。使用这些数据,我测试了您的代码,并进行了以下调整。
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import nltk
#had to install 'punkt'
nltk.download('punkt')
nltk.download('stopwords')
testDataSet = pd.read_csv("data.csv")
# For testing if the code works I only used a TestDatasSet, and no trainingData.
class PreProcessTweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
# To make it clear I changed the parameter to df_of_tweets (df = dataframe)
def processTweets(self, df_of_tweets):
processedTweets=[]
#turning the dataframe into lists
# in my data I did not have a label, so I used sentiment instead.
list_of_tweets = df_of_tweets.text.tolist()
list_of_sentiment = df_of_tweets.sentiment.tolist()
# using enumerate to keep track of the index of the tweets so I can use it to index the list of sentiment
for index, tweet in enumerate(list_of_tweets):
# adjusted the code here so that it takes values of the lists straight away.
processedTweets.append((self._processTweet(tweet), list_of_sentiment[index]))
return processedTweets
def _processTweet(self, tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetProcessor = PreProcessTweets()
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
tweetProcessor = PreProcessTweets()
print(preprocessedTestSet)希望它能帮上忙!
https://stackoverflow.com/questions/56150680
复制相似问题