我编写了下面的Python代码来抓取网站www.style.com中的图像
import urllib2, urllib, random, threading
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class Images(threading.Thread):
def __init__(self, lock, src):
threading.Thread.__init__(self)
self.src = src
self.lock = lock
def run(self):
self.lock.acquire()
urllib.urlretrieve(self.src,'./img/'+str(random.choice(range(9999))))
print self.src+'get'
self.lock.release()
def imgGreb():
lock = threading.Lock()
site_url = "http://www.style.com"
html = urllib2.urlopen(site_url).read()
soup = BeautifulSoup(html)
img=soup.findAll(['img'])
for i in img:
print i.get('src')
Images(lock, i.get('src')).start()
if __name__ == '__main__':
imgGreb()但我发现了一个错误:
'/images/homepage-2013-october/header/logo.png‘:IOError: Errno 2没有这样的文件或目录:
怎么才能解决呢?
这也能递归地找到网站中的所有图片吗?我指的是没有出现在主页上的其他图片。
谢谢!
发布于 2013-11-03 17:29:37
javascript:void(0);,这将永远不会得到页面。为了避免这个错误,我添加了try except。或者,您可以聪明地检测URL是否以jpg/gif/png结尾。我会为你做的:)urllib和beautifulsoup做什么。如果你真的想挑战自己,也许你可以尝试学习硒,这是一个更强大的工具。直接尝试下面的代码:
import urllib2
from bs4 import BeautifulSoup
import sys
from urllib import urlretrieve
reload(sys)
def imgGreb():
site_url = "http://www.style.com"
html = urllib2.urlopen(site_url).read()
soup = BeautifulSoup(html)
img=soup.findAll(['img'])
for i in img:
try:
# built the complete URL using the domain and relative url you scraped
url = site_url + i.get('src')
# get the file name
name = "result_" + url.split('/')[-1]
# detect if that is a type of pictures you want
type = name.split('.')[-1]
if type in ['jpg', 'png', 'gif']:
# if so, retrieve the pictures
urlretrieve(url, name)
except:
pass
if __name__ == '__main__':
imgGreb()https://stackoverflow.com/questions/19755512
复制相似问题