文章/答案/技术大牛

发布

社区首页 >问答首页 >Python映像爬虫BeautifulSoup：[Errno 2]没有这样的文件或目录

问Python映像爬虫BeautifulSoup：[Errno 2]没有这样的文件或目录
EN

Stack Overflow用户

提问于 2013-11-03 17:02:35

回答 1查看 1.9K关注 0票数 2

我编写了下面的Python代码来抓取网站www.style.com中的图像

 import urllib2, urllib, random, threading
 from bs4 import BeautifulSoup
 import sys
 reload(sys)
 sys.setdefaultencoding('utf-8')

 class Images(threading.Thread):
   def __init__(self, lock, src):
     threading.Thread.__init__(self)
     self.src = src
     self.lock = lock

   def run(self):
     self.lock.acquire()
     urllib.urlretrieve(self.src,'./img/'+str(random.choice(range(9999))))
     print self.src+'get'
     self.lock.release()

 def imgGreb():
   lock = threading.Lock()
   site_url = "http://www.style.com"
   html = urllib2.urlopen(site_url).read()
   soup = BeautifulSoup(html)
   img=soup.findAll(['img'])
   for i in img:
     print i.get('src')
     Images(lock, i.get('src')).start()

 if __name__ == '__main__':
   imgGreb()

但我发现了一个错误：

'/images/homepage-2013-october/header/logo.png‘：IOError: Errno 2没有这样的文件或目录：

怎么才能解决呢？

这也能递归地找到网站中的所有图片吗？我指的是没有出现在主页上的其他图片。

谢谢!

python

python-2.7

beautifulsoup

web-crawler

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-11-03 17:29:37

当您试图检索URL时，您使用的是没有域的相对路径。
有些图像是基于javascript的，您将得到相对路径为javascript:void(0);，这将永远不会得到页面。为了避免这个错误，我添加了try except。或者，您可以聪明地检测URL是否以jpg/gif/png结尾。我会为你做的:)
顺便说一句，不是所有的图片都包含在URL中，有些图片，漂亮的图片，叫做Javascript，我们只能用urllib和beautifulsoup做什么。如果你真的想挑战自己，也许你可以尝试学习硒，这是一个更强大的工具。

直接尝试下面的代码：

import urllib2
from bs4 import BeautifulSoup
import sys
from urllib import urlretrieve
reload(sys)


def imgGreb():
    site_url = "http://www.style.com"
    html = urllib2.urlopen(site_url).read()
    soup = BeautifulSoup(html)
    img=soup.findAll(['img'])
    for i in img:
        try:
            # built the complete URL using the domain and relative url you scraped
            url = site_url + i.get('src')
            # get the file name 
            name = "result_" + url.split('/')[-1] 
            # detect if that is a type of pictures you want
            type = name.split('.')[-1]
            if type in ['jpg', 'png', 'gif']:
                # if so, retrieve the pictures
                urlretrieve(url, name)
        except:
            pass

if __name__ == '__main__':
    imgGreb()

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/19755512

复制

相似问题

问Python映像爬虫BeautifulSoup：[Errno 2]没有这样的文件或目录
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python映像爬虫BeautifulSoup：[Errno 2]没有这样的文件或目录EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python映像爬虫BeautifulSoup：[Errno 2]没有这样的文件或目录
EN