我正试图抓取一个网站显示更多的按钮,我有困难的信息显示后,你点击“显示更多”。
目前,我正在尝试抓取这个网站上所有文章的链接:“https://www.nytimes.com/section/world”
我已经成功地使用selenium点击了“显示更多”按钮,但我仍然无法获得额外的链接。到目前为止,我的情况如下:
driver = webdriver.Chrome(executable_path="/Users/cherlin/Documents/北大/大一/文计/期末大作业/程序/chromedriver")
driver.get("https://www.nytimes.com/section/world")
element = driver.find_element_by_xpath('//*[@id="latest-panel"]/div[1]/div/div/button').click()
links = driver.find_elements_by_css_selector('a.story-link')这些链接显示为一份列有40篇文章的清单。我仍在努力弄清楚如何获得实际链接,但我需要弄清楚如何首先获得隐藏链接。
发布于 2019-01-21 07:54:52
可以使用requests库来获取JSON数据:
import requests
page = 0
items = [' '] # start non-empty
while items:
data = {"q" : "", "sort" : "newest", "page" : page, "dom" : "www.nytimes.com", "dedupe_hl" : "y"}
r = requests.get("https://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/section/world", params=data)
json_data = r.json()
items = json_data['members']['items']
for item in items:
print(f"{item['headline'][:50]:50} {item['url']}")
page += 1这将为您提供一个输出启动:
Lunar Eclipse and Supermoon: Photos From Around th https://www.nytimes.com/2019/01/21/science/lunar-eclipse-supermoon.html
By the Numbers, China’s Economy Is Worse Than It L https://www.nytimes.com/2019/01/20/business/china-economy-gdp-fourth-quarter.html
Henry Sy, the Philippines’ Richest Man and a Shopp https://www.nytimes.com/2019/01/20/world/asia/henry-sy-dead.html
Carlos Ghosn Offers Higher Bail and Security Guard https://www.nytimes.com/2019/01/20/business/carlos-ghosn-bail-japan.html
American Airstrike in Somalia Kills 52 Shabab Extr https://www.nytimes.com/2019/01/20/world/africa/airstrike-shabab-somalia.html这种方法将比使用selenium快得多。循环继续请求更多的页面,直到返回0项为止。
https://stackoverflow.com/questions/54282406
复制相似问题