我希望创建与以下网页中所示的表完全相同的表:https://247sports.com/college/penn-state/Season/2022-Football/Commits/
我目前正在使用开始在Google笔记本上实现这一点,因为在执行"read_html“命令时,我会收到禁止的错误。我刚刚开始获得一些输出,但我只想获取文本,而不是围绕它的外部内容。
到目前为止我的密码是..。
from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)
soup = BeautifulSoup(wd.page_source)
school=soup.find_all('span', class_='meta')
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')
status...and这是我的输出..。
[<p class="commit-date withDate"> Commit 7/25/2020 </p>,
<p class="commit-date withDate"> Commit 9/4/2020 </p>,
<p class="commit-date withDate"> Commit 1/1/2021 </p>,
<p class="commit-date withDate"> Commit 3/8/2021 </p>,
<p class="commit-date withDate"> Commit 10/29/2020 </p>,
<p class="commit-date withDate"> Commit 7/28/2020 </p>,
<p class="commit-date withDate"> Commit 9/8/2020 </p>,
<p class="commit-date withDate"> Commit 8/3/2020 </p>,
<p class="commit-date withDate"> Commit 5/1/2021 </p>]我们非常感谢在这方面提供的任何援助。
发布于 2021-05-04 16:41:23
没有必要使用Selenium,从网站获得响应您需要指定HTTP User-Agent头,否则,网站认为您的一个机器人,并会阻止您。
要创建一个DataFrame,请参阅下面的示例:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
response = requests.get(url, headers=headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="ri-page__list-item")[1:]: # `[1:]` Since the first result is a table header
school = tag.find_next("span", class_="meta").text
name = tag.find_next("a", class_="ri-page__name-link").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
status = tag.find_next("p", class_="commit-date withDate").text
data.append(
{
"school": school,
"name": name,
"position": position,
"height_weight": height_weight,
"rating": rating,
"nat_rank": nat_rank,
"state_rank": state_rank,
"pos_rank": pos_rank,
"status": status,
}
)
df = pd.DataFrame(data)
print(df.to_string())输出:
school name position height_weight rating nat_rank state_rank pos_rank status
0 Westerville South (Westerville, OH) Kaden Saunders WR 5-10 / 172 0.9509 116 5 16 Commit 7/25/2020
1 IMG Academy (Bradenton, FL) Drew Shelton OT 6-5 / 290 0.9468 130 17 14 Commit 9/4/2020
2 Central Dauphin East (Harrisburg, PA) Mehki Flowers WR 6-1 / 190 0.9461 131 4 18 Commit 1/1/2021
3 Medina (Medina, OH) Drew Allar PRO 6-5 / 220 0.9435 138 6 8 Commit 3/8/2021
4 Manheim Township (Lancaster, PA) Anthony Ivey WR 6-0 / 190 0.9249 190 6 26 Commit 10/29/2020
5 King (Milwaukee, WI) Jerry Cross TE 6-6 / 218 0.9153 218 4 8 Commit 7/28/2020
6 Northeast (Philadelphia, PA) Ken Talley WDE 6-3 / 230 0.9069 253 9 13 Commit 9/8/2020
7 Central York (York, PA) Beau Pribula DUAL 6-2 / 215 0.8891 370 12 9 Commit 8/3/2020
8 The Williston Northampton School (Easthampton, MA) Maleek McNeil OT 6-8 / 340 0.8593 705 8 64 Commit 5/1/2021 https://stackoverflow.com/questions/67387923
复制相似问题