文章/答案/技术大牛

发布

社区首页 >问答首页 >UnicodeDecodeError utf8编解码器Python2.7

问UnicodeDecodeError utf8编解码器Python2.7
EN

Stack Overflow用户

提问于 2017-05-24 17:30:41

回答 2查看 1.3K关注 0票数 0

我已经构建了一个从csv文件读取artistnames并通过Songkick api从这些艺术家那里收集artistdata的刮取器。然而，在运行我的代码一段时间后，我得到了以下错误：

  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte

样本数据可从here下载

我对编码比较陌生，我想知道如何解决这个错误？下面你可以找到我的代码。

            import urllib2
            import requests
            import json
            import csv

            from tinydb import TinyDB, Query
            db = TinyDB('spotify_artists.json')

            #read csv
            def wait_for_internet():
                while True:
                  try:
                    resp = urllib2.urlopen('http://google.com', timeout=1)
                    return
                  except:
                    pass

            def load_artists():
                    f = open('artistnames.csv', 'r').readlines();
                    for a in f:
                        artist = a.strip()
                        print(artist)
                        url = 'http://api.songkick.com/api/3.0/search/artists.json?query='+artist+'&apikey='
                        # wait_for_internet()
                        r = requests.get(url)
                        resp = r.json()
                        # print(resp)
                        try :
                          if(resp['resultsPage']['totalEntries']):
                            # print(json.dumps(resp['resultsPage']['results']['artist'], indent=4, sort_keys=True))
                            results = resp['resultsPage']['results']['artist'];
                            for x in results:
                            #   print('rxx')
                            #   print(json.dumps(x, indent=4, sort_keys=True))

                              if(x['displayName'] == artist):
                                print(x)
                                db.insert(x)

                        except:
                          print('cannot fetch url',url);



            load_artists()
            db.close()

Traceback (most recent call last):
  File "C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py", line 45, in <module>
    load_artists()
  File "C:C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py".py", line 25, in load_artists
    r = requests.get(url)
  File "C:\Python27\lib\site-packages\requests\api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 474, in request
    prep = self.prepare_request(req)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 407, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "C:\Python27\lib\site-packages\requests\models.py", line 302, in prepare
    self.prepare_url(url, params)
  File "C:\Python27\lib\site-packages\requests\models.py", line 358, in prepare_url
    url = url.decode('utf8')
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte

python

python-2.7

encoding

utf-8

回答 2

Stack Overflow用户

发布于 2017-05-24 18:15:20

问题出在您构建的URL中，您将查询字符串作为bytes (Python2.x上的regular str )和非utf-8编码的字符传递给requests模块，该模块反过来试图将其转换为utf-8Unicode字符串，但失败了。

首先，您应该让requests模块形成您的查询字符串，并处理最终URL的创建：

url = "http://api.songkick.com/api/3.0/search/artists.json"
r = requests.get(url, params={"query": artist, "apikey": ""})
# etc.

但第二，你不应该混合编码，至少你想置身于一个充满伤害的世界。不幸的是，内置的csv模块不能与Unicode一起工作，这可能是您最终得到无效字符的原因。要解决这个问题，可以安装unicodecsv并将其作为替代(只需将您的import csv替换为import unicodecsv as csv)。

更新：等等，再看一眼，你甚至没有使用csv。您正在逐行读取文件，并试图将其作为查询进行传递。这就是你的本意吗？如果是这样的话，保持使用相同编码的想法：

import codecs

URL = "http://api.songkick.com/api/3.0/search/artists.json" # no need to redefine this

with codecs.open("artistnames.csv", "r", "utf-8") as f:
    for a in f:
        artist = a.strip()
        r = requests.get(URL, params={"query": artist, "apikey": ""})
        # etc.

票数 0

Stack Overflow用户

发布于 2017-05-24 19:06:24

只要有可能，您就应该使用unicode。请求应将url中的任何非ascii字符转换为正确的编码。

>>> import requests  

>>> requests.get(u'http://Motörhead.com/?q=Motörhead').url  
u'http://xn--motrhead-p4a.com/?q=Mot%C3%B6rhead'

如您所见，域名编码为punycode，查询字符串使用percent-encoding。

只要artist是有效的unicode字符串，就应该可以这样做。

url = u'http://api.songkick.com/api/3.0/search/artists.json?query='+artist

如果artist是字节字符串，则必须使用正确的编码将其解码为unicode，这取决于原始输入文件的编码方式。

artist = artist.decode('SHIFT-JIS')

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44154432

复制

相似问题

问UnicodeDecodeError utf8编解码器Python2.7
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问UnicodeDecodeError utf8编解码器Python2.7EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问UnicodeDecodeError utf8编解码器Python2.7
EN