早上好。在我的df和代码的前20行下面。
当我试图按“<”分隔以从链接中移除强标记时,split只删除字符,split('<')[0]返回一个KeyError。
有什么办法让这件事起作用吗?
第一个想要的链接:
http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux
0
0 <a class="back" href="http://africa.espn.com/college-sports/football/recruiting/rankings">Back to Ranking Index</a>
1 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux" name=""></a>
2 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux"><strong>Kayvon Thibodeaux</strong></a>
3 <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222687/kayvon-thibodeaux">Scouts Report</a>
4 <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" class="floatleft" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
5 <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/2483/class/2019/oregon-ducks"><img class="valign-logo" src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/2483.png?w=110&h=110&transparent=true" style="width: 50px"/></a>
6 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith" name=""></a>
7 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith"><strong>Nolan Smith</strong></a>
8 <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/226752/nolan-smith">Scouts Report</a>
9 <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" class="floatleft" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
10 <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/61/class/2019/georgia-bulldogs"><img class="valign-logo" src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/61.png?w=110&h=110&transparent=true" style="width: 50px"/></a>
11 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/216987/kenyon-green" name=""></a>
12 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/216987/kenyon-green"><strong>Kenyon Green</strong></a>
13 <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/216987/kenyon-green">Scouts Report</a>
14 <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" class="floatleft" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
15 <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/245/class/2019/texas-aggies"><img class="valign-logo" src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/245.png?w=110&h=110&transparent=true" style="width: 50px"/></a>
16 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222156/evan-neal" name=""></a>
17 <a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222156/evan-neal"><strong>Evan Neal</strong></a>
18 <a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222156/evan-neal">Scouts Report</a>
19 <a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" class="floatleft" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>
20 <a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/333/class/2019/alabama-crimson-tide"><img class="valign-logo" src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/333.png?w=110&h=110&transparent=true" style="width: 50px"/></a>
#players.to_excel('Player_Links.xlsx')
players = pd.read_excel('Player_Links.xlsx')
players['Links'] = players.iloc[:,1]
players = players[players['Links'].str.contains('strong')]
players['Links'] = players['Links'].str.replace('<a href="','')
players['Links'] = players['Links'].str.split('<')
print(players)发布于 2021-12-27 15:14:29
过滤数据帧以获得带有<strong>标记的行。然后只有我们BeautifulSoup来解析html。在lambda函数中使用它:
from bs4 import BeautifulSoup
import pandas as pd
df = pd.DataFrame( [
['<a class="back" href="http://africa.espn.com/college-sports/football/recruiting/rankings">Back to Ranking Index</a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux" name=""></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux"><strong>Kayvon Thibodeaux</strong></a>'],
['<a href="http://insider.espn.com/college-sports/football/recruiting/player/evaluation/_/id/222687/kayvon-thibodeaux">Scouts Report</a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2019"><img border="0" class="floatleft" src="https://a.espncdn.com/i/recruiting/logos/2012/sml/rn-300_sml.png" title="ESPN 300"/></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/school/_/id/2483/class/2019/oregon-ducks"><img class="valign-logo" src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/2483.png?w=110&h=110&transparent=true" style="width: 50px"/></a>'],
['<a href="http://africa.espn.com/college-sports/football/recruiting/player/_/id/226752/nolan-smith" name=""></a>']],
columns=[0])
df_filter = df[df[0].str.contains('<strong>')]
df_filter[0] = df_filter[0].apply(lambda row: BeautifulSoup(row, 'html.parser').find('a')['href'])输出:
这就给我们留下了上面使用的示例集中的如下内容:
print(df_filter.to_string())
0
2 http://africa.espn.com/college-sports/football/recruiting/player/_/id/222687/kayvon-thibodeaux发布于 2021-12-27 15:34:42
您还可以使用正则表达式完成任何事情:
players = pd.read_excel('Player_Links.xlsx')
players['Links'] = players.iloc[:,1]
regex = r"(http:.*)\">.*<strong>"
players = players.Links.str.findall(regex)
# only keep the rows for which the regex hit
players = players[players.apply(lambda li: len(li) == 1)]
# flatten the list
players = players.apply(lambda li: li[0])
print(players)https://stackoverflow.com/questions/70497103
复制相似问题