因此,我正在尝试编写一个爬虫来继续单击网页上的next按钮,直到它不能再点击为止(或者直到我添加一些逻辑使其停止为止)。下面的代码正确地获取了下一页的链接,但只打印了一次。我的问题是,为什么它不“跟随”每个“下一步”按钮指向的链接?
class MyprojectSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select('//div[@class="nav-buttons"]//a/@href').extract()
if next_page:
yield Request(next_page[1], self.parse)
print(next_page[1])发布于 2017-11-06 08:45:23
要转到下一页,您只需生成类似以下代码的scrapy.Request object,而不是打印链接:
import scrapy
class MyprojectSpider(scrapy.Spider):
name = 'myproject'
allowed_domains = ['reddit.com']
start_urls = ['https://www.reddit.com/r/nfl/']
def parse(self, response):
posts = response.xpath('//div[@class="top-matter"]')
for post in posts:
# Get your data here
title = post.xpath('p[@class="title"]/a/text()').extract()
print(title)
# Go to next page
next_page = response.xpath('//span[@class="next-button"]/a/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)更新:之前的代码是错误的,需要使用绝对网址,还有一些Xpath是错误的,这个新的应该可以工作。
希望它能帮上忙!
https://stackoverflow.com/questions/47128116
复制相似问题