我想抓取链接的下一页:https://www.thetoptens.com/animals/,使用scrapy-selenium点击next按钮,但它抓取了链接的第一页。我也尝试过使用webdriver,但显示了相同的结果。
使用scrapy-selenium的代码:
import scrapy
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
class MovieSpider(scrapy.Spider):
name = "movies"
def start_requests(self):
yield SeleniumRequest(
url='https://www.thetoptens.com/animals/',
callback=self.parse,
wait_time=3,
wait_until=EC.element_to_be_clickable((By.XPATH, "//div[@class='pages']/a[@class='g' and text()=9]"))
)
def parse(self, response):
main_category = response.xpath("//div[@class='listgrid']/a/text()").getall()
yield {
"main": main_category}使用webdriver(Selenium)的代码:
import scrapy
from selenium import webdriver
class MovieSpider(scrapy.Spider):
name = "movies"
start_urls = ["https://www.thetoptens.com/animals/"]
def __init__(self):
self.driver = webdriver.Chrome(r"D:\python\chromedriver\chromedriver.exe")
def parse(self, response):
self.driver.get(self.start_urls[0])
#next_page = self.driver.find_element_by_xpath("//div[@class='pages']/a[@class='g' and text()=9]")
next_page = self.driver.find_element_by_xpath("//div[@class='pages']/a[text()=2]")
if next_page:
next_page.click()
a = response.xpath("//div[@class='listgrid']/a/text()").getall()
yield {
"aaa": a
}
self.driver.close()发布于 2020-12-17 22:59:41
代码是按顺序运行的,并且您不会遍历潜在的可用页面数,只需添加一个while循环,直到它不再找到'next‘a标记。
import scrapy
from selenium import webdriver
class MovieSpider(scrapy.Spider):
name = "movies"
start_urls = ["https://www.thetoptens.com/animals/"]
def __init__(self):
self.driver = webdriver.Chrome(r"D:\python\chromedriver\chromedriver.exe")
def parse(self, response):
nextFlag = True
self.driver.get(self.start_urls[0])
while nextFlag:
next_page = self.driver.find_element_by_xpath("//div[@class='pages']/a[@class='g' and text()=9]")
if next_page:
a = response.xpath("//div[@class='listgrid']/a/text()").getall()
yield {
"aaa": a
}
next_page.click()
else:
nextFlag = False
self.driver.close()发布于 2021-01-12 20:19:11
一种解决方案是使用ajax请求的urls。要找到它们,请使用浏览器的检查器,在其网络选项卡中:
https://www.thetoptens.com/ajaxcategory.asp?p=1&c=16&sc=&ssc=&n=1367&sort=&type=
https://www.thetoptens.com/ajaxcategory.asp?p=2&c=16&sc=&ssc=&n=1367&sort=&type=
https://www.thetoptens.com/ajaxcategory.asp?p=3&c=16&sc=&ssc=&n=1367&sort=&type=……
文档:
https://stackoverflow.com/questions/65342988
复制相似问题