文章/答案/技术大牛

发布

问刮伤不能爬行craigslist
EN

Stack Overflow用户

提问于 2013-02-07 15:46:07

回答 1查看 1.2K关注 0票数 1

这同样的代码爬行黄皮书，没有任何问题和预期。将规则更改为CL，然后点击第一个url，然后在没有相关输出的情况下退出。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigs.items import CraigsItem

class MySpider(CrawlSpider):
        name = "craigs"
        allowed_domains = ["craiglist.org"]

        start_urls = ["http://newyork.craigslist.org/cpg/"]

        rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)), follow=True, callback='parse_profile')]

        def parse_profile(self, response):
                found = []
                img = CraigsItem()
                hxs = HtmlXPathSelector(response)
                img['title'] = hxs.select('//h2[contains(@class, "postingtitle")]/text()').extract()
                img['text'] = hxs.select('//section[contains(@id, "postingbody")]/text()').extract()
                img['tags'] =  hxs.select('//html/body/article/section/section[2]/section[2]/ul/li[1]').extract()

                print found[0]
                return found[0]

这里是输出http://pastie.org/6087878，正如您所看到的，它没有问题获得第一个url来爬行http://newyork.craigslist.org/mnh/cpg/3600242403.html>，但随后就死了。

我可以使用CLI并转储所有链接，比如这个SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote3/p/a'，)).extract_links(响应)和xpath或关键字SgmlLinkExtractor(allow=r'/cpg/.+').extract_links(response)

输出-> http://pastie.org/6085322

但是在爬行中，相同的查询失败。WTF？

scrapy

rules

web-crawler

回答 1

Stack Overflow用户

发布于 2013-02-08 07:12:51

如果您查看文档，您将看到

allowed_domains包含允许爬行器爬行的域的可选字符串列表。如果启用OffsiteMiddleware，不属于此列表中指定的域名的URL请求将不会被遵循。

允许的域是

 allowed_domains = ["craiglist.org"]

但是您正在尝试获取一个子域。

02-07 15:39:03+0000 [craigs] DEBUG: Filtered offsite request to 'newyork.craigslist.org': <GET http://newyork.craigslist.org/mnh/cpg/3600242403.html>

这就是它被过滤的原因

要么从爬虫中删除allowed_domains，在其中添加适当的域，以避免过滤的外部请求

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/14755173

复制

相似问题

问刮伤不能爬行craigslist
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问刮伤不能爬行craigslistEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问刮伤不能爬行craigslist
EN