我有困难从ESPN记分板网页抓取espn Gamecast链接。我试过:
site = "https://www.espn.com/mlb/scoreboard"
html = requests.get(site).text
soup = BeautifulSoup(html, 'html.parser').find_all('a')
links = [link.get('href') for link in soup]但这些联系并没有被识别出来。
发布于 2021-06-30 08:42:33
它是动态加载的,因此您需要( a)使用Selenium之类的东西(允许页面在使用bs4进行解析之前呈现),或者( b)直接使用数据源/api。Api通常是最好的选择:
import requests
api = 'http://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard'
jsonData = requests.get(api).json()
events = jsonData['events']
links = []
for event in events:
event_links = event['links']
for each in event_links:
if each['text'] == 'Gamecast':
links.append(each['href'])输出:
print(links)
['http://www.espn.com/mlb/game/_/gameId/401228229', 'http://www.espn.com/mlb/game/_/gameId/401228235', 'http://www.espn.com/mlb/game/_/gameId/401228242', 'http://www.espn.com/mlb/game/_/gameId/401228240', 'http://www.espn.com/mlb/game/_/gameId/401228233', 'http://www.espn.com/mlb/game/_/gameId/401228234', 'http://www.espn.com/mlb/game/_/gameId/401228239', 'http://www.espn.com/mlb/game/_/gameId/401228237', 'http://www.espn.com/mlb/game/_/gameId/401228231', 'http://www.espn.com/mlb/game/_/gameId/401228232', 'http://www.espn.com/mlb/game/_/gameId/401228236', 'http://www.espn.com/mlb/game/_/gameId/401228230', 'http://www.espn.com/mlb/game/_/gameId/401228238', 'http://www.espn.com/mlb/game/_/gameId/401228243', 'http://www.espn.com/mlb/game/_/gameId/401228241']发布于 2021-06-24 13:24:23
会不会是你漏掉了引号?我已经尝试了以下几种方法,并能产生输出。
site = 'https://www.espn.com/mlb/scoreboard/_/date/20210624'
html = requests.get(site).text
soup = BeautifulSoup(html, 'html.parser').find_all('a')
links = [link.get('href') for link in soup]
print(links)https://stackoverflow.com/questions/68116431
复制相似问题