首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在我的python代码中编码UTF-8字符有问题。它们显示为文字UTF-8。

在我的python代码中编码UTF-8字符有问题。它们显示为文字UTF-8。
EN

Stack Overflow用户
提问于 2017-02-01 02:36:35
回答 1查看 2.6K关注 0票数 1

我有一份名单

“由于这个周末的暴风雨,我们重新安排了2月26日的布鲁门菲尔德自行车比赛。希望能在那里见到你。\xe2x80\xa6 6,“这个周末阳光充足,利用海滩巴士把你从伍德兰山送去海滩,只为$\xe2\x80\xa6”,"RT @LHansenLA:昨天在@LAPPL @EagleandBadge上看到了手表纪念墙尽头的新平台。“向fallen @LAPD w/\xE2\x80\xa6 6‘、“高兴地加入Art Sherman和Wings Over @Wendys来纪念退伍军人&由Ron和\XE2\x80\xa6’主持的每周15周年会议”,“与我一起参加第四届Blumenfield自行车骑行活动”。乘坐2个轮子享受西部山谷。答复:“”]

正如您所看到的,不幸的是,这些列表显示的是文字UTF-8,而不是字符本身。在我的代码中的某个时刻,我将代码编码到UTF-8中。

代码语言:javascript
复制
outtweets = [[str(tweet.text.encode("utf-8"))] for tweet in correct_date_tweet]            
            outtweets = [[stuff.replace("b\'", "")] for sublist in outtweets for stuff in sublist]
            outtweets = [[stuff.replace('b\"', "")] for sublist in outtweets for stuff in sublist]

为了删除b前缀,上述代码都是必需的。这些不能出现在我的推文中,因为我正在做机器学习分析,而让bs影响它。

我的问题

如何用实际字符替换UTF-8脚本?

我需要对其进行编码,因为我正在从(3个城市)x (50名官员)x(每条推文12个月)中提取tweet,所以尝试手动替换它们是不可能的效率低下。

代码

代码语言:javascript
复制
import tweepy #https://github.com/tweepy/tweepy

#Twitter API credentials
consumer_key = "insert key here"
consumer_secret = "insert key here"
access_key = "insert key here"
access_secret = "insert key here"

#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)



#!/usr/bin/env python
# encoding: utf-8

import tweepy #https://github.com/tweepy/tweepy
import json
import csv
import datetime
from datetime import datetime
import os.path
failed_accounts = []

def get_all_tweets(screen_name,mode):

    #try:
        #Twitter only allows access to a users most recent 3240 tweets with this method

        #initialize a list to hold all the tweepy Tweets
        alltweets = []    

        #make initial request for most recent tweets (200 is the maximum allowed count)
        new_tweets = api.user_timeline(screen_name = screen_name,count=200)

        #save most recent tweets
        alltweets.extend(new_tweets)

        #save the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1
        i = 0

        num_req = 0
        #keep grabbing tweets until there are no tweets left to grab
        while len(new_tweets) > 0:

            #all subsiquent requests use the max_id param to prevent duplicates
            new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)

            #save most recent tweets
            alltweets.extend(new_tweets)

            #update the id of the oldest tweet less one
            oldest = alltweets[-1].id - 1

            print ("...%s tweets downloaded so far" % (len(alltweets)))


            num_req = num_req + 1

            # makes further requests only if batch doesn't contain tweets beyond oldest limit
            oldest_limit = datetime(2016, 1, 20,0,0,0) 



            x = 0 


            for tweet in new_tweets: 
                raw_date = tweet.created_at
                if raw_date < oldest_limit:
                    x = 1
                else:
                    continue

            if x == 1:
                break

            #BSP this script is designed to just keep going. I want it to stop. 
            #i = i + 1 

            #if i == 10:
            #    break




        print("Number of Tweet Request Rounds: %s" %num_req)
        correct_date_tweet = []

        for tweet in alltweets:
            raw_date = tweet.created_at
            #date = datetime.strptime(raw_date, "%Y-%m-%d %H:%M:%S")

            newest_limit = datetime(2017, 1, 20,0,0,0)
            oldest_limit = datetime(2016, 1, 20,0,0,0) 

            if  raw_date > oldest_limit and raw_date < newest_limit: 
                correct_date_tweet.append(tweet)
            else:
                continue


        #transform the tweepy tweets into a 2D array that will populate the csv
        if mode == "tweets only" or "instance file": 
            outtweets = [[str(tweet.text.encode("utf-8"))] for tweet in correct_date_tweet]            
            outtweets = [[stuff.replace("b\'", "")] for sublist in outtweets for stuff in sublist]
            outtweets = [[stuff.replace('b\"', "")] for sublist in outtweets for stuff in sublist]
            outtweets = [["1   ",stuff.replace('"', "")] for sublist in outtweets for stuff in sublist]
            #outtweets = [["1   ",stuff] for sublist in outtweets for stuff in sublist]
        else: 
            outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8"),tweet.retweet_count,tweet.favorite_count,len(tweet.entities.get("hashtags")),len(tweet.entities.get("urls")),len(tweet.entities.get("user_mentions"))] for tweet in correct_date_tweet]    

        #write the csv
        if mode == "instance file":
            with open(os.path.join(save_location,'%s.instance' % screen_name), mode ='w') as f:
                writer = csv.writer(f) 
                writer.writerows(outtweets)
        else:
            with open(os.path.join(save_location,'%s.csv' % screen_name), 'w',encoding='utf-8') as f:
                writer = csv.writer(f)
                if mode != "tweets only":
                    writer.writerow(["id","created_at","text","retweets","favorites","hashtags","urls"])    
                writer.writerows(outtweets)

        pass
        print("Done with %s" % screen_name)

get_all_tweets("BobBlumenfield","instance file")

更新

基于一个答案,我尝试将其中一行更改为outtweets = [[tweet.text] for tweet in correct_date_tweet]

但这不起作用,因为它产生了

代码语言:javascript
复制
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-12-a864b5efe8af> in <module>()
----> 1 get_all_tweets("BobBlumenfield","instance file")

<ipython-input-9-d0b9b37c7261> in get_all_tweets(screen_name, mode)
    104             with open(os.path.join(save_location,'%s.instance' % screen_name), mode ='w') as f:
    105                 writer = csv.writer(f)
--> 106                 writer.writerows(outtweets)
    107         else:
    108             with open(os.path.join(save_location,'%s.csv' % screen_name), 'w',encoding='utf-8') as f:

C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-02-03 18:45:22

删除以下行:

代码语言:javascript
复制
outtweets = [[str(tweet.text.encode("utf-8"))] for tweet in correct_date_tweet] 

原因如下:

  1. 你是在用字节字符串编码。因此出现了b
  2. 您使用的是没有定义编码的str。在这种模式下,您将得到数组的representation (包括类型),因此也就是b和UTF-8转义。
  3. 没有必要在代码的中间进行编码。只在写入文件或网络时(而不是在打印时)进行编码。如果你使用.encode()的内置编码器,你很少需要自己打电话给open()

当您在文本模式下使用open()时,请始终指定编码,因为每个平台的编码不同。

从代码中删除.encode()的所有其他用途。

现在可以删除试图更正错误的其他行。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/41971013

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档