我正试着让一只刮擦的蜘蛛工作,但SgmlLinkExtractor似乎有问题。
这是签名:
SgmlLinkExtractor(allow=(),deny=(),allow_domains=(),deny_domains=(),restrict_xpaths(),tags=('a','area'),attrs=('href'),canonicalize=True,unique=True,process_value=None)
我使用了allow()选项,下面是我的代码:
start_urls = ['http://bigbangtrans.wordpress.com']
rules = [Rule(SgmlLinkExtractor(allow=[r'series-\d{1}-episode-\d{2}.']), callback='parse_item')]一个示例url看起来像http://bigbangtrans.wordpress.com/series-1-episode-11-the-pancake-batter-anomaly/
scrapy crawl tbbt的输出包含
tbbt调试:爬行(200) http://bigbangtrans.wordpress.com/series-3-episode-17-the-precious-fragmentation/> (引用程序:http://bigbangtrans.wordpress.com)
然而,parse_item回调没有被调用,我也不知道为什么。
这是蜘蛛的全部代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class TbbtSpider(CrawlSpider):
#print '\n TbbtSpider \n'
name = 'tbbt'
start_urls = ['http://bigbangtrans.wordpress.com'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'series-\d{1}-episode-\d{2}.']), callback='parse_item')]
def parse_item(self, response):
print '\n parse_blogpost \n'
hxs = HtmlXPathSelector(response)
item = TbbtItem()
# Extract title
item['title'] = hxs.select('//div[@id="post-5"]/div/p/span/text()').extract() # XPath selector for title
return item发布于 2013-01-29 01:23:35
好的,所以这段代码不能工作是因为您的规则的语法是不正确的,我修正了语法而没有做任何其他的更改,并且我能够点击parse_item回调。
rules = (
Rule(SgmlLinkExtractor(allow=(r'series-\d{1}-episode-\d{2}.',),
),
callback='parse_item'),
)然而,标题都是空白的,这表明hxs.select语句在parse_item中是不正确的。下面的xpath可能更适合使用(我对所需的标题有了一定的了解,但我可能完全搞错了)
item['title'] = hxs.select('//h2[@class="title"]/text()').extract() https://stackoverflow.com/questions/14573886
复制相似问题