我写了一个从某个网页提取网址的代码,我面临的问题是它没有像在网页上那样提取URL,我的意思是如果URL是用某种法语来提取的,它就不会按原来的方式提取它。我该如何解决这个问题?
import requests
from bs4 import BeautifulSoup
for i in range(0,500):
o=36*i
r=requests.get('http://www.barneys.com/barneys-new-york/men?start='+str(o)+'&format=page-element&sz=36')
soup=BeautifulSoup(r.text)
links=soup.find_all("a",{"class":"thumb-link"})
for link in links:
print link.get('href')发布于 2015-07-09 19:33:56
URL为百分比编码。例如,在HTML中,URL可能是
http://www.barneys.com/rick-owens-boucl%C3%A9-scarf-504025220.html但是在您的浏览器中,URL可能会显示为
http://www.barneys.com/rick-owens-bouclé-scarf-504025220.html要解码百分比编码的URL,请在Python2中使用Python2,在Python3中使用urllib.parse.unquote:
>>> print(urllib.unquote('http://www.barneys.com/rick-owens-boucl%C3%A9-scarf-504025220.html'))
http://www.barneys.com/rick-owens-bouclé-scarf-504025220.htmlimport requests
from bs4 import BeautifulSoup
try:
# Python2
from urllib import unquote
except ImportError:
# Python3
from urllib.parse import unquote
for i in range(0,500):
o=36*i
r=requests.get('http://www.barneys.com/barneys-new-york/men?start='+str(o)+'&format=page-element&sz=36')
soup=BeautifulSoup(r.text)
links=soup.find_all("a",{"class":"thumb-link"})
for link in links:
print(unquote(link.get('href')))在Python3中,link.get返回一个unicode str。在Python2中,link.get返回字节的str,可以用utf-8对其进行解码以获得unicode:
print(unquote(link.get('href')).decode('utf-8'))https://stackoverflow.com/questions/31326420
复制相似问题