我有一份名单
“由于这个周末的暴风雨,我们重新安排了2月26日的布鲁门菲尔德自行车比赛。希望能在那里见到你。\xe2x80\xa6 6,“这个周末阳光充足,利用海滩巴士把你从伍德兰山送去海滩,只为$\xe2\x80\xa6”,"RT @LHansenLA:昨天在@LAPPL @EagleandBadge上看到了手表纪念墙尽头的新平台。“向fallen @LAPD w/\xE2\x80\xa6 6‘、“高兴地加入Art Sherman和Wings Over @Wendys来纪念退伍军人&由Ron和\XE2\x80\xa6’主持的每周15周年会议”,“与我一起参加第四届Blumenfield自行车骑行活动”。乘坐2个轮子享受西部山谷。答复:“”]
正如您所看到的,不幸的是,这些列表显示的是文字UTF-8,而不是字符本身。在我的代码中的某个时刻,我将代码编码到UTF-8中。
outtweets = [[str(tweet.text.encode("utf-8"))] for tweet in correct_date_tweet]
outtweets = [[stuff.replace("b\'", "")] for sublist in outtweets for stuff in sublist]
outtweets = [[stuff.replace('b\"', "")] for sublist in outtweets for stuff in sublist]为了删除b前缀,上述代码都是必需的。这些不能出现在我的推文中,因为我正在做机器学习分析,而让bs影响它。
我的问题
如何用实际字符替换UTF-8脚本?
我需要对其进行编码,因为我正在从(3个城市)x (50名官员)x(每条推文12个月)中提取tweet,所以尝试手动替换它们是不可能的效率低下。
代码
import tweepy #https://github.com/tweepy/tweepy
#Twitter API credentials
consumer_key = "insert key here"
consumer_secret = "insert key here"
access_key = "insert key here"
access_secret = "insert key here"
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import json
import csv
import datetime
from datetime import datetime
import os.path
failed_accounts = []
def get_all_tweets(screen_name,mode):
#try:
#Twitter only allows access to a users most recent 3240 tweets with this method
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
i = 0
num_req = 0
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
print ("...%s tweets downloaded so far" % (len(alltweets)))
num_req = num_req + 1
# makes further requests only if batch doesn't contain tweets beyond oldest limit
oldest_limit = datetime(2016, 1, 20,0,0,0)
x = 0
for tweet in new_tweets:
raw_date = tweet.created_at
if raw_date < oldest_limit:
x = 1
else:
continue
if x == 1:
break
#BSP this script is designed to just keep going. I want it to stop.
#i = i + 1
#if i == 10:
# break
print("Number of Tweet Request Rounds: %s" %num_req)
correct_date_tweet = []
for tweet in alltweets:
raw_date = tweet.created_at
#date = datetime.strptime(raw_date, "%Y-%m-%d %H:%M:%S")
newest_limit = datetime(2017, 1, 20,0,0,0)
oldest_limit = datetime(2016, 1, 20,0,0,0)
if raw_date > oldest_limit and raw_date < newest_limit:
correct_date_tweet.append(tweet)
else:
continue
#transform the tweepy tweets into a 2D array that will populate the csv
if mode == "tweets only" or "instance file":
outtweets = [[str(tweet.text.encode("utf-8"))] for tweet in correct_date_tweet]
outtweets = [[stuff.replace("b\'", "")] for sublist in outtweets for stuff in sublist]
outtweets = [[stuff.replace('b\"', "")] for sublist in outtweets for stuff in sublist]
outtweets = [["1 ",stuff.replace('"', "")] for sublist in outtweets for stuff in sublist]
#outtweets = [["1 ",stuff] for sublist in outtweets for stuff in sublist]
else:
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8"),tweet.retweet_count,tweet.favorite_count,len(tweet.entities.get("hashtags")),len(tweet.entities.get("urls")),len(tweet.entities.get("user_mentions"))] for tweet in correct_date_tweet]
#write the csv
if mode == "instance file":
with open(os.path.join(save_location,'%s.instance' % screen_name), mode ='w') as f:
writer = csv.writer(f)
writer.writerows(outtweets)
else:
with open(os.path.join(save_location,'%s.csv' % screen_name), 'w',encoding='utf-8') as f:
writer = csv.writer(f)
if mode != "tweets only":
writer.writerow(["id","created_at","text","retweets","favorites","hashtags","urls"])
writer.writerows(outtweets)
pass
print("Done with %s" % screen_name)
get_all_tweets("BobBlumenfield","instance file")更新
基于一个答案,我尝试将其中一行更改为outtweets = [[tweet.text] for tweet in correct_date_tweet]
但这不起作用,因为它产生了
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-12-a864b5efe8af> in <module>()
----> 1 get_all_tweets("BobBlumenfield","instance file")
<ipython-input-9-d0b9b37c7261> in get_all_tweets(screen_name, mode)
104 with open(os.path.join(save_location,'%s.instance' % screen_name), mode ='w') as f:
105 writer = csv.writer(f)
--> 106 writer.writerows(outtweets)
107 else:
108 with open(os.path.join(save_location,'%s.csv' % screen_name), 'w',encoding='utf-8') as f:
C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>发布于 2017-02-03 18:45:22
删除以下行:
outtweets = [[str(tweet.text.encode("utf-8"))] for tweet in correct_date_tweet] 原因如下:
b。str。在这种模式下,您将得到数组的representation (包括类型),因此也就是b和UTF-8转义。.encode()的内置编码器,你很少需要自己打电话给open()。当您在文本模式下使用open()时,请始终指定编码,因为每个平台的编码不同。
从代码中删除.encode()的所有其他用途。
现在可以删除试图更正错误的其他行。
https://stackoverflow.com/questions/41971013
复制相似问题