编辑:我从原文开始主要编辑了这篇文章的内容,以指明我的问题:
我正在写一个下载网络漫画的程序,当我下载漫画的一个页面时,我得到了这个奇怪的错误。我正在运行的代码基本上可以归结为下面一行代码,后面跟着错误。我不知道是什么导致了这个错误,这让我非常困惑。
>>> urllib.request.urlopen("http://abominable.cc/post/47699281401")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 470, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 580, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 502, in error
result = self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 685, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.4/urllib/request.py", line 464, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 482, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1211, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1183, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1172, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 1014, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 37-38: ordinal not in range(128)我的整个程序可以在这里找到:https://github.com/nstephenh/pycomic
发布于 2016-06-02 03:21:52
我也有同样的问题。根本原因是远程服务器没有遵守规则。HTTP报头应该是US-ASCII码,但显然主要的Headers (apache2,nginx)并不关心,直接发送UTF8编码的字符串。
然而,在http.client中,parse_header函数以iso-8859的形式获取报头,而urllib中的默认HTTPRedirectHandler并不关心引用位置或URI报头,从而导致上述错误。
我可以通过覆盖默认的HTTPRedirectHandler并添加三行代码来对抗latin1解码并添加路径引用,从而“解决”这两个问题:
import urllib.request
from urllib.error import HTTPError
from urllib.parse import (
urlparse, quote, urljoin, urlunparse)
class UniRedirectHandler(urllib.request.HTTPRedirectHandler):
# Implementation note: To avoid the server sending us into an
# infinite loop, the request object needs to track what URLs we
# have already seen. Do this by adding a handler-specific
# attribute to the Request object.
def http_error_302(self, req, fp, code, msg, headers):
# Some servers (incorrectly) return multiple Location headers
# (so probably same goes for URI). Use first header.
if "location" in headers:
newurl = headers["location"]
elif "uri" in headers:
newurl = headers["uri"]
else:
return
# fix a possible malformed URL
urlparts = urlparse(newurl)
# For security reasons we don't allow redirection to anything other
# than http, https or ftp.
if urlparts.scheme not in ('http', 'https', 'ftp', ''):
raise HTTPError(
newurl, code,
"%s - Redirection to url '%s' is not allowed" % (msg, newurl),
headers, fp)
if not urlparts.path:
urlparts = list(urlparts)
urlparts[2] = "/"
else:
urlparts = list(urlparts)
# Header should only contain US-ASCII chars, but some servers do send unicode data
# that should be quoted back before reused
# Need to re-encode the string as iso-8859-1 before use of ""quote"" to cancel the effet of parse_header() in http/client.py
urlparts[2] = quote(urlparts[2].encode('iso-8859-1'))
newurl = urlunparse(urlparts)
newurl = urljoin(req.full_url, newurl)
# XXX Probably want to forget about the state of the current
# request, although that might interact poorly with other
# handlers that also use handler-specific request attributes
new = self.redirect_request(req, fp, code, msg, headers, newurl)
if new is None:
return
# loop detection
# .redirect_dict has a key url if url was previously visited.
if hasattr(req, 'redirect_dict'):
visited = new.redirect_dict = req.redirect_dict
if (visited.get(newurl, 0) >= self.max_repeats or
len(visited) >= self.max_redirections):
raise HTTPError(req.full_url, code,
self.inf_msg + msg, headers, fp)
else:
visited = new.redirect_dict = req.redirect_dict = {}
visited[newurl] = visited.get(newurl, 0) + 1
# Don't close the fp until we are sure that we won't use it
# with HTTPError.
fp.read()
fp.close()
return self.parent.open(new, timeout=req.timeout)
http_error_301 = http_error_303 = http_error_307 = http_error_302
[...]
# Change default Redirect Handler in urllib, should be done once at the beginning of the program
opener = urllib.request.build_opener(UniRedirectHandler())
urllib.request.install_opener(opener)这是python3代码,但如果需要的话,应该可以很容易地适应python2。
https://stackoverflow.com/questions/33370509
复制相似问题