下面的代码从网页中提取链接,并在浏览器中显示它们。对于很多UTF-8编码的网页,这非常有效。但是,例如,法语维基百科页面http://fr.wikipedia.org/wiki/États_unis会产生一个错误。
# -*- coding: utf-8 -*-
print 'Content-Type: text/html; charset=utf-8\n'
print '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Show Links</title>
</head>
<body>'''
import urllib2, lxml.html as lh
def load_page(url):
headers = {'User-Agent' : 'Mozilla/5.0 (compatible; testbot/0.1)'}
try:
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
return page
except:
print '<b>Couldn\'t load:', url, '</b><br>'
return None
def show_links(page):
tree = lh.fromstring(page)
for node in tree.xpath('//a'):
if 'href' in node.attrib:
url = node.attrib['href']
if '#' in url:
url=url.split('#')[0]
if '@' not in url and 'javascript' not in url:
if node.text:
linktext = node.text
else:
linktext = '-'
print '<a href="%s">%s</a><br>' % (url, linktext.encode('utf-8'))
page = load_page('http://fr.wikipedia.org/wiki/%C3%89tats_unis')
show_links(page)
print '''
</body>
</html>
'''我得到以下错误:
Traceback (most recent call last):
File "C:\***\question.py", line 42, in <module>
show_links(page)
File "C:\***\question.py", line 39, in show_links
print '<a href="%s">%s</a><br>' % (url, linktext.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)我的系统: Python 2.6 (Windows),lxml 2.3.3,Apache Server (显示结果)
我做错了什么?
发布于 2012-03-23 05:23:24
您还需要对url进行编码。
问题可能类似于:
>>> "%s%s" % (u"", "€ <-non-ascii char in a bytestring")
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in
range(128)但这是可行的:
>>> "%s%s" % (u"".encode('utf-8'), "€ <-non-ascii char in a bytestring")
'\xe2\x82\xac <-non-ascii char in a bytestring'空的Unicode字符串强制将整个表达式转换为Unicode。因此,您会看到Unicode Decode错误。
一般来说,混合使用Unicode和字节串不是一个好主意。它可能看起来在工作,但迟早会崩溃。收到文本后立即将其转换为Unicode,对其进行处理,然后将其转换为用于I/O的字节。
https://stackoverflow.com/questions/9828971
复制相似问题