文章/答案/技术大牛

发布

社区首页 >问答首页 >使用刮擦时，爬行0页(0页/分钟)刮0项(0项/分钟)

问使用刮擦时，爬行0页(0页/分钟)刮0项(0项/分钟)
EN

Stack Overflow用户

提问于 2017-01-03 09:05:24

回答 1查看 2.8K关注 0票数 0

我刚开始学习Python和Scrapy。

我的第一个项目是在包含网络安全信息的网站上抓取信息。但是当我使用cmd运行它时，它说

爬行0页(0页/分钟)刮0项(0项/分钟)

似乎什么都没出来。如果有人能解决我的问题，我将不胜感激。

以下是我的蜘蛛文件：

from ssl_abuse.items import SslAbuseItem
import scrapy

class SslAbuseSpider(scrapy.Spider):
    name='ssl_abuse'
    start_urls=['https://sslbl.abuse.ch/']
    def parse(self, response):
        for sel in response.xpath('/table//tr'):
            item=SslAbuseItem()
            item['date']=sel.xpath('/td/text()')[0].extract()
            item['name']=sel.xpath('/td/text()')[2].extract()
            item['type']=sel.xpath('/td/text()')[3].extract()
            yield item

下面是我要爬的网站：

https://sslbl.abuse.ch/

我想得到那张图表的所有元素，除了SHA1指纹。

正如威尔所说的，在我更改代码之后，出现了一个错误：

`2017-01-04 09:31:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-04 09:31:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-04 09:31:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sslbl.abuse.ch/robots.txt> (referer: None)
2017-01-04 09:31:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sslbl.abuse.ch/> (referer: None)
2017-01-04 09:31:53 [scrapy.core.scraper] ERROR: Spider error processing <GET https://sslbl.abuse.ch/> (referer: None)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "V:\work\ssl_abuse\ssl_abuse\spiders\ssl_abuse_spider.py", line 11, in parse
    item['date']=sel.xpath('/td/text()')[0].extract()
  File "c:\python27\lib\site-packages\parsel\selector.py", line 58, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range`

我的更新代码：

from ssl_abuse.items import SslAbuseItem
import scrapy
class SslAbuseSpider(scrapy.Spider):
    name='ssl_abuse'
    start_urls=['https://sslbl.abuse.ch/']
    def parse(self, response):
        for sel in response.xpath('//table//tr'):
            item=SslAbuseItem()
            item['date']=sel.xpath('/td/text()')[0].extract()
            item['name']=sel.xpath('/td/text()')[2].extract()
            item['type']=sel.xpath('/td/text()')[3].extract()
            yield item`

python

scrapy

web-crawler

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-01-03 10:01:38

我用刮破的壳做了个快速测试。xpath定位器似乎有问题。response.body看起来像：

...
<table class="sortable">
<tr><th>Listing date (UTC)</th><th>SHA1 fingerprint</th><th>Common Name</th><th>Listing reason</th></tr>
<tr bgcolor="#D8D8D8" onmouseover="this.style.backgroundColor='#3371A3';" onmouseout="this.style.backgroundColor='#D8D8D8';"><td>2016-12-30 07:54:19</td><td><a href="/intel/1d05c6fef14d2671d759a05b496464b831c650e8" target="_parent" title="Show more information about this SSL certificate">1d05c6fef14d2671d759a05b496464b831c650e8</a></td><td>host/emailAddress=web@host</td><td>Gootkit C&amp;C</td></tr>
<tr bgcolor="#ffffff" onmouseover="this.style.backgroundColor='#3371A3';" onmouseout="this.style.backgroundColor='#ffffff';"><td>2016-12-28 10:03:54</td><td><a href="/intel/a82dd258544acf0a109296493421262397741db7" target="_parent" title="Show more information about this SSL certificate">a82dd258544acf0a109296493421262397741db7</a></td><td>google.com/emailAddress=web@google.com</td><td>Gootkit C&amp;C</td></tr>
<tr bgcolor="#D8D8D8" onmouseover="this.style.backgroundColor='#3371A3';" onmouseout="this.style.backgroundColor='#D8D8D8';"><td>2016-12-27 19:19:35</td><td><a href="/intel/df6f665e91d2fe8a338f778ad53c1921fcab3d8f" target="_parent" title="Show more information about this SSL certificate">df6f665e91d2fe8a338f778ad53c1921fcab3d8f</a></td><td>CN=p.fmsacademy.it</td><td>Gozi MITM</td></tr>
...

第一项是表头，真正的内容从第二行开始。例如：

# scrapy shell 'https://sslbl.abuse.ch/'
>>> rows = response.xpath('//table//tr')
>>> head = rows[0]

>>> head.xpath('th/text()').extract()
[u'Listing date (UTC)', u'SHA1 fingerprint', u'Common Name', u'Listing reason']

>>> td1 = rows[1]
>>> td1.xpath('td')
[<Selector xpath='td' data=u'<td>2016-12-30 07:54:19</td>'>, <Selector xpath='td' data=u'<td><a href="/intel/1d05c6fef14d2671d759'>, <Selector xpath='td' data=u'<td>host/emailAddress=web@host</td>'>, <Selector xpath='td' data=u'<td>Gootkit C&amp;C</td>'>]

>>> td1.xpath('td/text()').extract()
[u'2016-12-30 07:54:19', u'host/emailAddress=web@host', u'Gootkit C&C']

因此，定位tr的xpath应该是：

for sel in response.xpath('//table//tr'):

查找td文本的xpath是：

item['date']=sel.xpath('td/text()')[0].extract()

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/41439871

复制

相似问题

问使用刮擦时，爬行0页(0页/分钟)刮0项(0项/分钟)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用刮擦时，爬行0页(0页/分钟)刮0项(0项/分钟)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用刮擦时，爬行0页(0页/分钟)刮0项(0项/分钟)
EN