我在这里和其他网站上读了很多关于刮痕的文章,但我不能解决这个问题,所以我问你:P希望有人能帮我。
我想对主客户端页面中的登录名进行身份验证,然后解析所有类别,然后解析所有产品,并保存产品的标题、类别、数量和价格。
我的代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging
class article(Item):
category = Field()
title = Field()
quantity = Field()
price = Field()
class combatzone_spider(CrawlSpider):
name = 'combatzone_spider'
allowed_domains = ['www.combatzone.es']
start_urls = ['http://www.combatzone.es/areadeclientes/']
rules = (
Rule(LinkExtractor(allow=r'/category.php?id=\d+'),follow=True),
Rule(LinkExtractor(allow=r'&page=\d+'),follow=True),
Rule(LinkExtractor(allow=r'goods.php?id=\d+'),follow=True,callback='parse_items'),
)
def init_request(self):
logging.info("You are in initRequest")
return Request(url=self,callback=self.login)
def login(self,response):
logging.info("You are in login")
return scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)
def check_login_response(self,response):
logging.info("You are in checkLogin")
if "Hola,XXXX" in response.body:
self.log("Succesfully logged in.")
return self.initialized()
else:
self.log("Something wrong in login.")
def parse_items(self,response):
logging.info("You are in item")
item = scrapy.loader.ItemLoader(article(),response)
item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
yield item.load_item()当我在终端上运行抓取爬行蜘蛛时,我得到了这样的信息:
pi@raspberry:~/SCRAPY/combatzone/combatzone/spiders $ scrapy爬行combatzone_spider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning:不再推荐模块
scrapy.contrib.spiders,使用scrapy.spiders代替scrapy.contrib.spiders.init导入InitSpider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning:模块scrapy.contrib.spiders.init是不推荐的,使用scrapy.spiders.init代替scrapy.contrib.spiders.init导入InitSpider 2018-07-24 22:14:53 scrapy.utils.log INFO: Scrapy 1.5.1已启动(bot:战斗区) 2018-07-24 22:14:53 scrapy.utils.log信息:版本: lxml 4.2.3.0,libxml2 2.9.8,cssselect 1.0.3,parsel 1.5.0,w3lib 1.19.0,Twisted 18.7.0,Python2.7.13(默认,11月24日,2017,17:33:09) - GCC 6.3.0 20170516,pyOpenSSL 18.0.0 (2018年3月27日OpenSSL 1.1.0h ),密码学2.3,平台Linux-4.9.0-6-686-with-debian-9.5 2018-07-24 22:14:53 scrapy.crawler信息:被覆盖的设置:{‘NEWSPIDER_scrapy.crawler_模块’,‘参战蜘蛛’,‘SPIDER_模块’:‘战斗zone.蜘蛛’,'LOG_LEVEL':'INFO',‘'BOT_NAME':’战斗区‘} 2018-07-24 22:14:53 scrapy.middleware信息:已启用扩展:’scrapy.exsions.memusage.MemoryUsage‘,’scrapy.exsions.logstats.LogStats‘,’scrapy.Extensions.telnetTelnetConsole‘,'scrapy.extensions.corestats.CoreStats’2018-07-24 22:14:53 scrapy.middleware信息:已启用的下装载机中间件:scrapy.middleware 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats‘2018-07-24 22:14:53 scrapy.middleware信息:启用蜘蛛中间件:'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware‘2018-07-24 22:14:53 scrapy.middleware信息:已启用项目管道:[] 2018-07-24 22:14:53 scrapy.core.engine INFO: 2018-07-24 22:14:53 scrapy.extensions.logstats INFO:爬行0页(以0页/分),刮0项(0项/分) 2018-07-24 22:14:54 scrapy.core.engine INFO:关闭蜘蛛(已完成) 2018-07-24 22:14:54 scrapy.statscollectors信息:倾倒Scrapy统计数据:{‘下载器/请求_字节’:231,‘下载程序/请求_计数’:1,‘下载程序/请求方法__’:1,‘下载程序/响应_字节’:7152,‘下载/响应_计数’:1,“下载程序/响应_状态_计数/200”:1,“完成”:“finish”,“finish_time”:datetime.datetime(2018年,7,24,21,14,54,410938),'log_count/INFO':7,'memusage/max':36139008,'memusage/startup':36139008,‘response_接收_count’:1,‘调度程序/脱队列’:1,‘调度程序/退出队列/内存’:1,“调度器/排队”:1,“调度器/排队/内存”:1,“开始时间”:datetime.datetime(2018年,7,24,21,14,53,998619)} 2018-07-24 22:14:54 scrapy.core.engine INFO: scrapy.core.engine已关闭(完成)
蜘蛛似乎不起作用了,知道为什么会这样吗?非常感谢各位朋友:D
发布于 2018-07-25 02:55:50
有两个问题:
/category.php?id=\d+应该改为/category.php\?id=\d+(注意"\?")至于登录,我试图使您的代码工作,但我失败了。在爬行之前,我通常会重写start_requests来登录。
以下是代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging
class article(Item):
category = Field()
title = Field()
quantity = Field()
price = Field()
class CombatZoneSpider(CrawlSpider):
name = 'CombatZoneSpider'
allowed_domains = ['www.combatzone.es']
start_urls = ['http://www.combatzone.es/areadeclientes/']
rules = (
# escape "?"
Rule(LinkExtractor(allow=r'category.php\?id=\d+'),follow=False),
Rule(LinkExtractor(allow=r'&page=\d+'),follow=False),
Rule(LinkExtractor(allow=r'goods.php\?id=\d+'),follow=False,callback='parse_items'),
)
def parse_items(self,response):
logging.info("You are in item")
# This is used to print the results
selector = scrapy.Selector(response=response)
res = selector.xpath("/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()").extract()
self.logger.info(res)
# item = scrapy.loader.ItemLoader(article(),response)
# item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
# item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
# item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
# item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
# yield item.load_item()
# login part
# I didn't test if it can login because I have no accounts, but they will print something in console.
def start_requests(self):
logging.info("You are in initRequest")
return [scrapy.Request(url="http://www.combatzone.es/areadeclientes/user.php",callback=self.login)]
def login(self,response):
logging.info("You are in login")
# generate the start_urls again:
for url in self.start_urls:
yield self.make_requests_from_url(url)
# yield scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)
# def check_login_response(self,response):
# logging.info("You are in checkLogin")
# if "Hola,XXXX" in response.body:
# self.log("Succesfully logged in.")
# return self.initialized()
# else:
# self.log("Something wrong in login.")https://stackoverflow.com/questions/51507882
复制相似问题