文章/答案/技术大牛

发布

社区首页 >问答首页 >使用python scrapy抓取同一链接的下一页

问使用python scrapy抓取同一链接的下一页
EN

Stack Overflow用户

提问于 2020-12-17 22:50:41

回答 2查看 288关注 0票数 1

我想抓取链接的下一页：https://www.thetoptens.com/animals/，使用scrapy-selenium点击next按钮，但它抓取了链接的第一页。我也尝试过使用webdriver，但显示了相同的结果。

使用scrapy-selenium的代码：

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
class MovieSpider(scrapy.Spider):
    name = "movies"
    def start_requests(self):
        yield SeleniumRequest(
        url='https://www.thetoptens.com/animals/',
        callback=self.parse,
        wait_time=3,
        wait_until=EC.element_to_be_clickable((By.XPATH, "//div[@class='pages']/a[@class='g' and text()=9]"))
            )
    def parse(self, response):
        main_category = response.xpath("//div[@class='listgrid']/a/text()").getall()
        yield {
        "main": main_category}

使用webdriver(Selenium)的代码：

import scrapy
from selenium import webdriver
class MovieSpider(scrapy.Spider):
    name = "movies"
    start_urls = ["https://www.thetoptens.com/animals/"]
    def __init__(self):
        self.driver = webdriver.Chrome(r"D:\python\chromedriver\chromedriver.exe")
    def parse(self, response):
        self.driver.get(self.start_urls[0])
        #next_page = self.driver.find_element_by_xpath("//div[@class='pages']/a[@class='g' and text()=9]")
        next_page = self.driver.find_element_by_xpath("//div[@class='pages']/a[text()=2]")
        if next_page:
            next_page.click()
            a = response.xpath("//div[@class='listgrid']/a/text()").getall()
            yield {
                "aaa": a
                 }
        self.driver.close()

python

selenium

scrapy

回答 2

Stack Overflow用户

发布于 2020-12-17 22:59:41

代码是按顺序运行的，并且您不会遍历潜在的可用页面数，只需添加一个while循环，直到它不再找到'next‘a标记。

import scrapy
from selenium import webdriver
class MovieSpider(scrapy.Spider):
    name = "movies"
    start_urls = ["https://www.thetoptens.com/animals/"]
    def __init__(self):
        self.driver = webdriver.Chrome(r"D:\python\chromedriver\chromedriver.exe")
    def parse(self, response):
        nextFlag = True
        self.driver.get(self.start_urls[0])
        while nextFlag:
            next_page = self.driver.find_element_by_xpath("//div[@class='pages']/a[@class='g' and text()=9]")
            if next_page:
                a = response.xpath("//div[@class='listgrid']/a/text()").getall()
                yield {
                    "aaa": a
                 }
                next_page.click()
            else:
                nextFlag = False
        self.driver.close()

票数 1

Stack Overflow用户

发布于 2021-01-12 20:19:11

一种解决方案是使用ajax请求的urls。要找到它们，请使用浏览器的检查器，在其网络选项卡中：

https://www.thetoptens.com/ajaxcategory.asp?p=1&c=16&sc=&ssc=&n=1367&sort=&type=
https://www.thetoptens.com/ajaxcategory.asp?p=2&c=16&sc=&ssc=&n=1367&sort=&type=
https://www.thetoptens.com/ajaxcategory.asp?p=3&c=16&sc=&ssc=&n=1367&sort=&type=

……

文档：

Using your browser’s Developer Tools for scraping in Scrapy

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65342988

复制

相似问题

问使用python scrapy抓取同一链接的下一页
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python scrapy抓取同一链接的下一页EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python scrapy抓取同一链接的下一页
EN