首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >刮伤CrawlSpider不加入

刮伤CrawlSpider不加入
EN

Stack Overflow用户
提问于 2018-07-24 21:36:46
回答 1查看 326关注 0票数 0

我在这里和其他网站上读了很多关于刮痕的文章,但我不能解决这个问题,所以我问你:P希望有人能帮我。

我想对主客户端页面中的登录名进行身份验证,然后解析所有类别,然后解析所有产品,并保存产品的标题、类别、数量和价格。

我的代码:

代码语言:javascript
复制
# -*- coding: utf-8 -*-

import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging

class article(Item):
    category = Field()
    title = Field()
    quantity = Field()
    price = Field()

class combatzone_spider(CrawlSpider):
    name = 'combatzone_spider'
    allowed_domains = ['www.combatzone.es']
    start_urls = ['http://www.combatzone.es/areadeclientes/']

    rules = (
        Rule(LinkExtractor(allow=r'/category.php?id=\d+'),follow=True),
        Rule(LinkExtractor(allow=r'&page=\d+'),follow=True),
        Rule(LinkExtractor(allow=r'goods.php?id=\d+'),follow=True,callback='parse_items'),
    )

def init_request(self):
    logging.info("You are in initRequest")
    return Request(url=self,callback=self.login)

def login(self,response):
    logging.info("You are in login")
    return scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)

def check_login_response(self,response):
    logging.info("You are in checkLogin")
    if "Hola,XXXX" in response.body:
        self.log("Succesfully logged in.")
        return self.initialized()
    else:
        self.log("Something wrong in login.")

def parse_items(self,response):
    logging.info("You are in item")
    item = scrapy.loader.ItemLoader(article(),response)
    item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
    item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
    item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
    item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
    yield item.load_item()

当我在终端上运行抓取爬行蜘蛛时,我得到了这样的信息:

pi@raspberry:~/SCRAPY/combatzone/combatzone/spiders $ scrapy爬行combatzone_spider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning:不再推荐模块scrapy.contrib.spiders,使用scrapy.spiders代替scrapy.contrib.spiders.init导入InitSpider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning:模块scrapy.contrib.spiders.init是不推荐的,使用scrapy.spiders.init代替scrapy.contrib.spiders.init导入InitSpider 2018-07-24 22:14:53 scrapy.utils.log INFO: Scrapy 1.5.1已启动(bot:战斗区) 2018-07-24 22:14:53 scrapy.utils.log信息:版本: lxml 4.2.3.0,libxml2 2.9.8,cssselect 1.0.3,parsel 1.5.0,w3lib 1.19.0,Twisted 18.7.0,Python2.7.13(默认,11月24日,2017,17:33:09) - GCC 6.3.0 20170516,pyOpenSSL 18.0.0 (2018年3月27日OpenSSL 1.1.0h ),密码学2.3,平台Linux-4.9.0-6-686-with-debian-9.5 2018-07-24 22:14:53 scrapy.crawler信息:被覆盖的设置:{‘NEWSPIDER_scrapy.crawler_模块’,‘参战蜘蛛’,‘SPIDER_模块’:‘战斗zone.蜘蛛’,'LOG_LEVEL':'INFO',‘'BOT_NAME':’战斗区‘} 2018-07-24 22:14:53 scrapy.middleware信息:已启用扩展:’scrapy.exsions.memusage.MemoryUsage‘,’scrapy.exsions.logstats.LogStats‘,’scrapy.Extensions.telnetTelnetConsole‘,'scrapy.extensions.corestats.CoreStats’2018-07-24 22:14:53 scrapy.middleware信息:已启用的下装载机中间件:scrapy.middleware 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats‘2018-07-24 22:14:53 scrapy.middleware信息:启用蜘蛛中间件:'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware‘2018-07-24 22:14:53 scrapy.middleware信息:已启用项目管道:[] 2018-07-24 22:14:53 scrapy.core.engine INFO: 2018-07-24 22:14:53 scrapy.extensions.logstats INFO:爬行0页(以0页/分),刮0项(0项/分) 2018-07-24 22:14:54 scrapy.core.engine INFO:关闭蜘蛛(已完成) 2018-07-24 22:14:54 scrapy.statscollectors信息:倾倒Scrapy统计数据:{‘下载器/请求_字节’:231,‘下载程序/请求_计数’:1,‘下载程序/请求方法__’:1,‘下载程序/响应_字节’:7152,‘下载/响应_计数’:1,“下载程序/响应_状态_计数/200”:1,“完成”:“finish”,“finish_time”:datetime.datetime(2018年,7,24,21,14,54,410938),'log_count/INFO':7,'memusage/max':36139008,'memusage/startup':36139008,‘response_接收_count’:1,‘调度程序/脱队列’:1,‘调度程序/退出队列/内存’:1,“调度器/排队”:1,“调度器/排队/内存”:1,“开始时间”:datetime.datetime(2018年,7,24,21,14,53,998619)} 2018-07-24 22:14:54 scrapy.core.engine INFO: scrapy.core.engine已关闭(完成)

蜘蛛似乎不起作用了,知道为什么会这样吗?非常感谢各位朋友:D

EN

回答 1

Stack Overflow用户

发布于 2018-07-25 02:55:50

有两个问题:

  • 第一个是正则表达式,您应该避开"?“。例如:/category.php?id=\d+应该改为/category.php\?id=\d+(注意"\?")
  • 第二,您应该缩进所有方法,否则在类combatzone_spider中找不到它们。

至于登录,我试图使您的代码工作,但我失败了。在爬行之前,我通常会重写start_requests来登录。

以下是代码:

代码语言:javascript
复制
# -*- coding: utf-8 -*-

import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging

class article(Item):
    category = Field()
    title = Field()
    quantity = Field()
    price = Field()

class CombatZoneSpider(CrawlSpider):
    name = 'CombatZoneSpider'
    allowed_domains = ['www.combatzone.es']
    start_urls = ['http://www.combatzone.es/areadeclientes/']

    rules = (
        # escape "?"
        Rule(LinkExtractor(allow=r'category.php\?id=\d+'),follow=False),
        Rule(LinkExtractor(allow=r'&page=\d+'),follow=False),
        Rule(LinkExtractor(allow=r'goods.php\?id=\d+'),follow=False,callback='parse_items'),
    )

    def parse_items(self,response):
        logging.info("You are in item")

        # This is used to print the results
        selector = scrapy.Selector(response=response)
        res = selector.xpath("/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()").extract()
        self.logger.info(res)

        # item = scrapy.loader.ItemLoader(article(),response)
        # item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
        # item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
        # item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
        # item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
        # yield item.load_item()

    # login part
    # I didn't test if it can login because I have no accounts, but they will print something in console.

    def start_requests(self):
        logging.info("You are in initRequest")
        return [scrapy.Request(url="http://www.combatzone.es/areadeclientes/user.php",callback=self.login)]

    def login(self,response):
        logging.info("You are in login")

        # generate the start_urls again:
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

        # yield scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)

    # def check_login_response(self,response):
    #     logging.info("You are in checkLogin")
    #     if "Hola,XXXX" in response.body:
    #         self.log("Succesfully logged in.")
    #         return self.initialized()
    #     else:
    #         self.log("Something wrong in login.")
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/51507882

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档