我想创建一个Scrapy脚本来抓取任何craigslist子域中的计算机音乐会的所有结果:例如:http://losangeles.craigslist.org/search/cpg/这个查询返回了许多文章的列表,我试图用CrawlSpider和linkExtractor抓取每个结果的标题和href (不仅仅是第一页上的结果),但没有任何结果,但是脚本什么也没有返回。我会在这里粘贴我的脚本,谢谢
import scrapy
from scrapy.spiders import Rule,CrawlSpider
from scrapy.linkextractors import LinkExtractor
class CraigspiderSpider(CrawlSpider):
name = "CraigSpider"
allowed_domains = ["http://losangeles.craigslist.org"]
start_urls = (
'http://losangeles.craigslist.org/search/cpg/',
)
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_page", follow= True),)
def parse_page(self, response):
items = response.selector.xpath("//p[@class='row']")
for i in items:
link = i.xpath("./span[@class='txt']/span[@class='pl']/a/@href").extract()
title = i.xpath("./span[@class='txt']/span[@class='pl']/a/span[@id='titletextonly']/text()").extract()
print link,title发布于 2016-03-13 00:47:30
根据你粘贴的代码,parse_page
上面#2的原因是for循环没有正确缩进。
尝试缩进for循环:
class CraigspiderSpider(CrawlSpider):
name = "CraigSpider"
allowed_domains = ["http://losangeles.craigslist.org"]
start_urls = ('http://losangeles.craigslist.org/search/cpg/',)
rules = (Rule(
LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)),
callback="parse_page", follow= True))
def parse_page(self, response):
items = response.selector.xpath("//p[@class='row']")
for i in items:
link = i.xpath("./span[@class='txt']/span[@class='pl']/a/@href").extract()
title = i.xpath("./span[@class='txt']/span[@class='pl']/a/span[@id='titletextonly']/text()").extract()
print link, title
yield dict(link=link, title=title)https://stackoverflow.com/questions/35960330
复制相似问题