首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从足球招募网站抓取桌子

从足球招募网站抓取桌子
EN

Stack Overflow用户
提问于 2021-05-04 15:41:50
回答 1查看 47关注 0票数 1

我希望创建与以下网页中所示的表完全相同的表:https://247sports.com/college/penn-state/Season/2022-Football/Commits/

我目前正在使用开始在Google笔记本上实现这一点,因为在执行"read_html“命令时,我会收到禁止的错误。我刚刚开始获得一些输出,但我只想获取文本,而不是围绕它的外部内容。

到目前为止我的密码是..。

代码语言:javascript
复制
from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)

soup  = BeautifulSoup(wd.page_source)

school=soup.find_all('span', class_='meta')    
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')

status

...and这是我的输出..。

代码语言:javascript
复制
[<p class="commit-date withDate"> Commit 7/25/2020  </p>,
 <p class="commit-date withDate"> Commit 9/4/2020  </p>,
 <p class="commit-date withDate"> Commit 1/1/2021  </p>,
 <p class="commit-date withDate"> Commit 3/8/2021  </p>,
 <p class="commit-date withDate"> Commit 10/29/2020  </p>,
 <p class="commit-date withDate"> Commit 7/28/2020  </p>,
 <p class="commit-date withDate"> Commit 9/8/2020  </p>,
 <p class="commit-date withDate"> Commit 8/3/2020  </p>,
 <p class="commit-date withDate"> Commit 5/1/2021  </p>]

我们非常感谢在这方面提供的任何援助。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-05-04 16:41:23

没有必要使用Selenium,从网站获得响应您需要指定HTTP User-Agent头,否则,网站认为您的一个机器人,并会阻止您。

要创建一个DataFrame,请参阅下面的示例:

代码语言:javascript
复制
import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}


response = requests.get(url, headers=headers).content
soup = BeautifulSoup(response, "html.parser")
data = []

for tag in soup.find_all("li", class_="ri-page__list-item")[1:]:  # `[1:]` Since the first result is a table header
    school = tag.find_next("span", class_="meta").text
    name = tag.find_next("a", class_="ri-page__name-link").text
    position = tag.find_next("div", class_="position").text
    height_weight = tag.find_next("div", class_="metrics").text
    rating = tag.find_next("span", class_="score").text
    nat_rank = tag.find_next("a", class_="natrank").text
    state_rank = tag.find_next("a", class_="sttrank").text
    pos_rank = tag.find_next("a", class_="posrank").text
    status = tag.find_next("p", class_="commit-date withDate").text

    data.append(
        {
            "school": school,
            "name": name,
            "position": position,
            "height_weight": height_weight,
            "rating": rating,
            "nat_rank": nat_rank,
            "state_rank": state_rank,
            "pos_rank": pos_rank,
            "status": status,
        }
    )

df = pd.DataFrame(data)

print(df.to_string())

输出:

代码语言:javascript
复制
                                                    school            name position height_weight  rating nat_rank state_rank pos_rank                status
0                  Westerville South (Westerville, OH)      Kaden Saunders      WR    5-10 / 172   0.9509      116          5       16    Commit 7/25/2020  
1                          IMG Academy (Bradenton, FL)        Drew Shelton      OT     6-5 / 290   0.9468      130         17       14     Commit 9/4/2020  
2                Central Dauphin East (Harrisburg, PA)       Mehki Flowers      WR     6-1 / 190   0.9461      131          4       18     Commit 1/1/2021  
3                                  Medina (Medina, OH)          Drew Allar     PRO     6-5 / 220   0.9435      138          6        8     Commit 3/8/2021  
4                     Manheim Township (Lancaster, PA)        Anthony Ivey      WR     6-0 / 190   0.9249      190          6       26   Commit 10/29/2020  
5                                 King (Milwaukee, WI)         Jerry Cross      TE     6-6 / 218   0.9153      218          4        8    Commit 7/28/2020  
6                         Northeast (Philadelphia, PA)          Ken Talley     WDE     6-3 / 230   0.9069      253          9       13     Commit 9/8/2020  
7                              Central York (York, PA)        Beau Pribula    DUAL     6-2 / 215   0.8891      370         12        9     Commit 8/3/2020  
8   The Williston Northampton School (Easthampton, MA)       Maleek McNeil      OT     6-8 / 340   0.8593      705          8       64     Commit 5/1/2021  
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67387923

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档