首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Scrapy不是提取数据,css选择器是正确的。

Scrapy不是提取数据,css选择器是正确的。
EN

Stack Overflow用户
提问于 2017-10-26 21:54:08
回答 2查看 151关注 0票数 0

这是我的第一个刮刀,我有点麻烦。首先,我创建了我的css选择器,它们在使用scrapy时工作。当我运行我的蜘蛛时,它只返回以下内容

代码语言:javascript
复制
2017-10-26 14:48:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: digikey)
2017-10-26 14:48:49 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'digikey', 'CONCURRENT_REQUESTS': 1, 'NEW
SPIDER_MODULE': 'digikey.spiders', 'SPIDER_MODULES': ['digikey.spiders'], 'USER_AGENT': 'digikey ("Mozilla/5.0 (Windows
NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02")'}
2017-10-26 14:48:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-26 14:48:50 [scrapy.core.engine] INFO: Spider opened
2017-10-26 14:48:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-10-26 14:48:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-26 14:48:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikey.com/products/en/capacitors/alumin
um-electrolytic-capacitors/58/page/3?stock=1> (referer: None)
2017-10-26 14:48:52 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-26 14:48:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 329,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 104631,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 10, 26, 21, 48, 52, 235020),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 10, 26, 21, 48, 50, 249076)}
2017-10-26 14:48:52 [scrapy.core.engine] INFO: Spider closed (finished)
PS C:\Users\dalla_000\digikey>

我的蜘蛛长得像这样

代码语言:javascript
复制
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from digikey.items import DigikeyItem
from scrapy.selector import Selector

class DigikeySpider(CrawlSpider):
    name = 'digikey'
    allowed_domains = ['digikey.com']
    start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1', ), deny=('subsection\.php', ))),
)
def parse_item(self, response):
    for row in response.css('table#productTable tbody tr'):
        item = DigikeyItem()
        item['partnumber'] = row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first()
        item['manufacturer'] =  row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first()
        item['description'] = row.css('.tr-description::text').extract_first()
        item['quanity'] = row.css('.tr-qtyAvailable::text').extract_first()
        item['price'] = row.css('.tr-unitPrice::text').extract_first()
        item['minimumquanity'] = row.css('.tr-minQty::text').extract_first()
        yield item

        parse_start_url = parse_item

items.py看起来像:

代码语言:javascript
复制
import scrapy


class DigikeyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    partnumber = scrapy.Field()
    manufacturer = scrapy.Field()
    description = scrapy.Field()
    quanity= scrapy.Field()
    minimumquanity = scrapy.Field()
    price = scrapy.Field()
    pass'

设置:

代码语言:javascript
复制
BOT_NAME = 'digikey'

SPIDER_MODULES = ['digikey.spiders']
NEWSPIDER_MODULE = 'digikey.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'digikey ("Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02")'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

我很难理解为什么没有使用工作的css selectors.Furthermore提取数据,爬行器正在完成工作和关闭。我限制蜘蛛只爬行一页,当它正常工作时,我会为整个网站打开它。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-10-26 23:07:50

这对我有用(不要更改缩进):

代码语言:javascript
复制
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from scrapy.selector import Selector

class DigikeyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    partnumber = scrapy.Field()
    manufacturer = scrapy.Field()
    description = scrapy.Field()
    quanity= scrapy.Field()
    minimumquanity = scrapy.Field()
    price = scrapy.Field()

class DigikeySpider(CrawlSpider):
    name = 'digikey'
    allowed_domains = ['digikey.com']
    start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3', )),callback='parse_item'),
    )

    def parse_item(self, response):
        for row in response.css('table#productTable tbody tr'):
            item = DigikeyItem()
            item['partnumber'] = row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first()
            item['manufacturer'] =  row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first()
            item['description'] = row.css('.tr-description::text').extract_first()
            item['quanity'] = row.css('.tr-qtyAvailable::text').extract_first()
            item['price'] = row.css('.tr-unitPrice::text').extract_first()
            item['minimumquanity'] = row.css('.tr-minQty::text').extract_first()
            yield item

    parse_start_url = parse_item

但是,如果要测试规则,请将start_urls更改为

代码语言:javascript
复制
 start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58']

并删除parse_start_url = parse_item

票数 0
EN

Stack Overflow用户

发布于 2017-10-26 22:09:13

我认为您不需要使用CrawlSpider,因为创建这个蜘蛛的目的是为了导航一个站点,比如一个论坛或博客,以获取不同类别的帖子或实际项目链接(这就是rules的目的)。

您的rules正在尝试获取不同的url到follow,然后从这些url访问与您的rule匹配的特定url,然后用这些url的响应调用那里指定的方法。

但是在您的示例中,您希望访问start_urls中的特定url,因此CrawlSpider不是这样工作的。您应该只使用Spider实现从start_urls urls获得响应。

代码语言:javascript
复制
class DigikeySpider(scrapy.Spider):
    name = 'digikey'
    allowed_domains = ['digikey.com']
    start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']

    def parse(self, response):
        ...
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/46964434

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档