我已经构建了一个从csv文件读取artistnames并通过Songkick api从这些艺术家那里收集artistdata的刮取器。然而,在运行我的代码一段时间后,我得到了以下错误:
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte样本数据可从here下载
我对编码比较陌生,我想知道如何解决这个错误?下面你可以找到我的代码。
import urllib2
import requests
import json
import csv
from tinydb import TinyDB, Query
db = TinyDB('spotify_artists.json')
#read csv
def wait_for_internet():
while True:
try:
resp = urllib2.urlopen('http://google.com', timeout=1)
return
except:
pass
def load_artists():
f = open('artistnames.csv', 'r').readlines();
for a in f:
artist = a.strip()
print(artist)
url = 'http://api.songkick.com/api/3.0/search/artists.json?query='+artist+'&apikey='
# wait_for_internet()
r = requests.get(url)
resp = r.json()
# print(resp)
try :
if(resp['resultsPage']['totalEntries']):
# print(json.dumps(resp['resultsPage']['results']['artist'], indent=4, sort_keys=True))
results = resp['resultsPage']['results']['artist'];
for x in results:
# print('rxx')
# print(json.dumps(x, indent=4, sort_keys=True))
if(x['displayName'] == artist):
print(x)
db.insert(x)
except:
print('cannot fetch url',url);
load_artists()
db.close()Traceback (most recent call last):
File "C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py", line 45, in <module>
load_artists()
File "C:C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py".py", line 25, in load_artists
r = requests.get(url)
File "C:\Python27\lib\site-packages\requests\api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 474, in request
prep = self.prepare_request(req)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 407, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Python27\lib\site-packages\requests\models.py", line 302, in prepare
self.prepare_url(url, params)
File "C:\Python27\lib\site-packages\requests\models.py", line 358, in prepare_url
url = url.decode('utf8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte发布于 2017-05-24 18:15:20
问题出在您构建的URL中,您将查询字符串作为bytes (Python2.x上的regular str )和非utf-8编码的字符传递给requests模块,该模块反过来试图将其转换为utf-8Unicode字符串,但失败了。
首先,您应该让requests模块形成您的查询字符串,并处理最终URL的创建:
url = "http://api.songkick.com/api/3.0/search/artists.json"
r = requests.get(url, params={"query": artist, "apikey": ""})
# etc.但第二,你不应该混合编码,至少你想置身于一个充满伤害的世界。不幸的是,内置的csv模块不能与Unicode一起工作,这可能是您最终得到无效字符的原因。要解决这个问题,可以安装unicodecsv并将其作为替代(只需将您的import csv替换为import unicodecsv as csv)。
更新:等等,再看一眼,你甚至没有使用csv。您正在逐行读取文件,并试图将其作为查询进行传递。这就是你的本意吗?如果是这样的话,保持使用相同编码的想法:
import codecs
URL = "http://api.songkick.com/api/3.0/search/artists.json" # no need to redefine this
with codecs.open("artistnames.csv", "r", "utf-8") as f:
for a in f:
artist = a.strip()
r = requests.get(URL, params={"query": artist, "apikey": ""})
# etc.发布于 2017-05-24 19:06:24
只要有可能,您就应该使用unicode。请求应将url中的任何非ascii字符转换为正确的编码。
>>> import requests
>>> requests.get(u'http://Motörhead.com/?q=Motörhead').url
u'http://xn--motrhead-p4a.com/?q=Mot%C3%B6rhead'如您所见,域名编码为punycode,查询字符串使用percent-encoding。
只要artist是有效的unicode字符串,就应该可以这样做。
url = u'http://api.songkick.com/api/3.0/search/artists.json?query='+artist如果artist是字节字符串,则必须使用正确的编码将其解码为unicode,这取决于原始输入文件的编码方式。
artist = artist.decode('SHIFT-JIS')https://stackoverflow.com/questions/44154432
复制相似问题