问获取关注链接的最好方法是抓取网络爬虫
EN

Stack Overflow用户

提问于 2017-11-06 08:29:10

回答 1查看 181关注 0票数 0

因此，我正在尝试编写一个爬虫来继续单击网页上的next按钮，直到它不能再点击为止(或者直到我添加一些逻辑使其停止为止)。下面的代码正确地获取了下一页的链接，但只打印了一次。我的问题是，为什么它不“跟随”每个“下一步”按钮指向的链接？

class MyprojectSpider(scrapy.Spider):
    name = 'redditbot'
    allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
    start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        next_page = hxs.select('//div[@class="nav-buttons"]//a/@href').extract()
        if next_page:
            yield Request(next_page[1], self.parse)
            print(next_page[1])

python

scrapy

web-crawler

回答 1

Stack Overflow用户

发布于 2017-11-06 08:45:23

要转到下一页，您只需生成类似以下代码的scrapy.Request object，而不是打印链接：

import scrapy

class MyprojectSpider(scrapy.Spider):
    name = 'myproject'
    allowed_domains = ['reddit.com']
    start_urls = ['https://www.reddit.com/r/nfl/']

    def parse(self, response):
        posts = response.xpath('//div[@class="top-matter"]')
        for post in posts:
            # Get your data here
            title = post.xpath('p[@class="title"]/a/text()').extract()
            print(title)
            # Go to next page
            next_page = response.xpath('//span[@class="next-button"]/a/@href').extract_first()
            if next_page:
                 yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

更新：之前的代码是错误的，需要使用绝对网址，还有一些Xpath是错误的，这个新的应该可以工作。

希望它能帮上忙！

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47128116

复制

相似问题

问获取关注链接的最好方法是抓取网络爬虫
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问获取关注链接的最好方法是抓取网络爬虫EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问获取关注链接的最好方法是抓取网络爬虫
EN