我使用了一个使用scrapy进行图像抓取的例子。
但是我没有在我的计算机上保存任何文件:
这是我使用的代码:
//Items.py//
import scrapy
class ImgurItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()//settings.py//
BOT_NAME = 'imgur'
SPIDER_MODULES = ['imgur.spiders']
NEWSPIDER_MODULE = 'imgur.spiders'
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = '/home/ubuntu/imgurFront/'//imgur_spider.py//
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem
class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['imgur.com']
start_urls = ['http://www.imgur.com']
rules = [Rule(LinkExtractor(allow=['/gallery/.*']), 'parse_imgur')]
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath(\
"//h2[@id='image-title']/text()").extract()
rel = response.xpath("//img/@src").extract()
image['image_urls'] = ['http:'+rel[0]]
return image这是我得到的响应类型:
{'image_urls': [u'http:data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7'],
'images': [],
'title': []}以下是我得到的错误:
[scrapy] ERROR: File (unknown-error): Error processing file from <GET http://i.imgur.com/BGVbmqM.jpg> referred in <None>
DEBUG: Retrying <GET http:howard-funk.jpg> (failed 1 times): Connection was refused by other side: 111: Connection refused
DEBUG: Scraped from <200发布于 2016-01-18 09:53:35
你用的是哪个版本的scrapy?请确保您有在文件夹上写入的权限。
在最后一种情况下,您可以创建自定义管道http://doc.scrapy.org/en/latest/topics/media-pipeline.html#custom-images-pipeline-example并捕获一些错误
发布于 2016-01-18 10:08:07
看起来比网站屏蔽机器人连接。尝试模拟http代理(谷歌上的RandomUserAgentMiddleware)和/或使用TOR或proxy和scrapy (settings.py上的HTTP_PROXY)。
发布于 2016-01-18 21:45:43
这里有两个问题:
使用urljoin的
image'image_urls‘= [response.urljoin(rel)]
data:image前缀的值,或者以不同的方式处理它们(因为这是图像文件内容,您不需要下载它)。https://stackoverflow.com/questions/34843580
复制相似问题