首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >连接URL和抓取数据时出现问题

连接URL和抓取数据时出现问题
EN

Stack Overflow用户
提问于 2020-08-14 04:51:00
回答 1查看 154关注 0票数 0

我尝试在python中附加一个URL,以便从目标URL中抓取详细信息。我有下面的代码,但它似乎是从url1而不是网址抓取数据。

我已经从NFL网站上抓取了球队的名字,没有任何问题。问题出在spotrac URL上,我在其中添加了我从NFL网站上抓取的球队名称。

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup   

URL ='https://www.nfl.com/teams/'

page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')

team_name = []

team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
  if team.find('p'):
      team_name.append(team.text)

for team in team_name: 
        
    team = team.replace(" ", "-").lower()

    url1 = 'https://www.spotrac.com/nfl/rankings/'
    URL = url1 +str(team)
    print(URL)
    data = {
        'ajax': 'true',
        'mobile': 'false'
    }
    
    bs_soup = BeautifulSoup(requests.post(URL, data=data).content, 'html.parser')
    spotrac_df = pd.DataFrame(columns = ['Name', 'Salary']) 
    
    for h3 in bs_soup.select('h3'):
        spotrac_df = spotrac_df.append(pd.DataFrame({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text)}, index=[0]), ignore_index=False)

我几乎可以肯定问题出在URL没有正确添加上。抓取的是url1的薪水等,而不是URL。

我的控制台输出(使用Spyder IDE)如下所示,用于打印(URL)

EN

回答 1

Stack Overflow用户

发布于 2020-08-14 16:23:47

url追加正确,但您的小组名称中有一个前导空格。我还做了一些其他更改,并在代码中记录了这些更改。

最后,(我曾经做过这两个),创建一个空的dataframe,然后在每次迭代后附加到它,我想这不是最好的方法。有人告诉我,最好使用列表/字典来构造行,然后在完成后,再调用pandas来构造数据帧,所以也改变了这一点。

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup   
import pandas as pd

url ='https://www.nfl.com/teams/'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

team_name = []

team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
  if team.find('p'):
      team_name.append(team.text.strip()) #<- remove leading/trailing white space

url1 = 'https://www.spotrac.com/nfl/rankings/' #<- since this is fixed, put it before the loop
spotrac_rows = []
for team in team_name: 
        
    team = '-'.join(team.split()).lower() #<- changed to split in case theres 2 spaces between city and team

    url1 = 'https://www.spotrac.com/nfl/rankings/'
    url = url1 + str(team)
    print(url)
    data = {
        'ajax': 'true',
        'mobile': 'false'
    }
    
    bs_soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
    
    for h3 in bs_soup.select('h3'):
        spotrac_rows.append({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text.strip())})  #<- remove white space from the salary
        
spotrac_df = pd.DataFrame(spotrac_rows)

输出:

代码语言:javascript
复制
print(spotrac_df)
                       Name       Salary
0            Chandler Jones  $21,333,333
1          Patrick Peterson  $13,184,588
2            D.J. Humphries  $12,800,000
3           DeAndre Hopkins  $12,500,000
4          Larry Fitzgerald  $11,750,000
5              Jordan Hicks  $10,500,000
6               Justin Pugh  $10,500,000
7              Kenyan Drake   $8,483,000
8              Kyler Murray   $8,080,601
9             Robert Alford   $7,500,000
10              J.R. Sweezy   $6,500,000
11             Corey Peters   $4,437,500
12           Haason Reddick   $4,288,444
13          Jordan Phillips   $4,000,000
14           Isaiah Simmons   $3,757,101
15            Maxx Williams   $3,400,000
16            Zane Gonzalez   $3,259,000
17            Devon Kennard   $2,500,000
18              Budda Baker   $2,173,184
19       De'Vondre Campbell   $2,000,000
20                 Andy Lee   $2,000,000
21             Byron Murphy   $1,815,795
22           Christian Kirk   $1,607,691
23             Aaron Brewer   $1,168,750
24               Max Garcia   $1,143,125
25            Andy Isabella   $1,052,244
26               Mason Cole     $977,629
27               Zach Allen     $975,855
28              Chris Banjo     $887,500
29         Jonathan Bullard     $887,500
                    ...          ...
2530       Khari Blasingame     $675,000
2531         Kenneth Durden     $675,000
2532         Cody Hollister     $675,000
2533              Joey Ivie     $675,000
2534            Greg Joseph     $675,000
2535             Kareem Orr     $675,000
2536     David Quessenberry     $675,000
2537        Derick Roberson     $675,000
2538           Shaun Wilson     $675,000
2539          Cole McDonald     $635,421
2540          Chris Jackson     $629,570
2541             Kobe Smith     $614,333
2542           Aaron Brewer     $613,333
2543           Cale Garrett     $613,333
2544           Tommy Hudson     $613,333
2545     Kristian Wilkerson     $613,333
2546  Khaylan Kearse-Thomas     $612,500
2547         Nick Westbrook     $612,333
2548          Kyle Williams     $611,833
2549           Mason Kinsey     $611,666
2550          Tucker McCann     $611,666
2551       Cameron Scarlett     $611,666
2552             Teair Tart     $611,666
2553           Brandon Kemp     $611,333
2554              Wyatt Ray     $610,000
2555             Josh Smith     $610,000
2556         Logan Woodside     $610,000
2557          Rashard Davis     $610,000
2558          Avery Gennesy     $610,000
2559           Parker Hesse     $610,000

[2560 rows x 2 columns]
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/63403040

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档