我正试着在这个页面上为一支球队刮记录(3-6-2)和一年:https://www.pro-football-reference.com/teams/pit/1933.htm。
我试着使用选择器小工具来提取正确的xpath或类,但是没有什么是正确的。我得到的最接近的记录是:
read_html(
curl("https://www.pro-football-reference.com/teams/pit/1933.htm",
handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_element(xpath='//*[@id="meta"]/div[2]/p[1]/strong') %>%
html_text()我希望输出是一个数据框架。对于如何在选择器小工具中访问这个元素的任何清晰性都会有帮助,因为我试图学习从这个和其他类似的页面中提取其他元素。谢谢!
发布于 2022-02-15 20:44:33
如果您只需要查找表,那么rvest的html_table函数就可以实现您想要的功能。
html_table(read_html("https://www.pro-football-reference.com/teams/pit/1933.htm"))[[1]]
# A tibble: 5 x 23
`` `` `` `Tot Yds & TO` `Tot Yds & TO` `Tot Yds & TO` `` `` Passing Passing
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Player PF Yds "Ply" "Y/P" TO FL "1st~ "Cmp" Att
2 Team ~ 67 1943 "534" "3.6" 40 0 "" "60" 196
3 Opp. ~ 208 2735 "583" "4.7" 19 0 "" "57" 142
4 Lg Ra~ 8 8 "" "" 9 1 "1" "" 1
5 Lg Ra~ 10 9 "" "" 9 1 "1" "" 2
# ... with 13 more variables: Passing <chr>, Passing <chr>, Passing <chr>, Passing <chr>,
# Passing <chr>, Rushing <chr>, Rushing <chr>, Rushing <chr>, Rushing <chr>,
# Rushing <chr>, Penalties <chr>, Penalties <chr>, Penalties <chr>
[[2]]
# A tibble: 12 x 22
`` `` `` `` `` `` `` `` `` `` Score Score Offense Offense
<chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Week Day Date NA "" "" "OT" Rec "" Opp Tm Opp "1stD" "TotYd"
2 1 Wed Septemb~ NA "box~ "L" "" 0-1 "" New ~ 2 23 "" ""
3 2 Wed Septemb~ NA "box~ "W" "" 1-1 "" Chic~ 14 13 "" ""
4 3 Wed October~ NA "box~ "L" "" 1-2 "" Bost~ 6 21 "" ""
5 4 Wed October~ NA "box~ "W" "" 2-2 "" Cinc~ 17 3 "" ""
6 5 Sun October~ NA "box~ "L" "" 2-3 "@" Gree~ 0 47 "" ""
7 6 Sun October~ NA "box~ "T" "" 2-3-1 "@" Cinc~ 0 0 "" ""
8 7 Sun October~ NA "box~ "W" "" 3-3-1 "@" Bost~ 16 14 "" ""
9 8 Sun Novembe~ NA "box~ "T" "" 3-3-2 "@" Broo~ 3 3 "" ""
10 9 Sun Novembe~ NA "box~ "L" "" 3-4-2 "" Broo~ 0 32 "" ""
11 10 Sun Novembe~ NA "box~ "L" "" 3-5-2 "@" Phil~ 6 25 "" ""
12 12 Sun Decembe~ NA "box~ "L" "" 3-6-2 "@" New ~ 3 27 "" ""
# ... with 8 more variables: Offense <chr>, Offense <chr>, Offense <chr>, Defense <chr>,
# Defense <chr>, Defense <chr>, Defense <chr>, Defense <chr>然后你可以索引和过滤得到你想要的值。
如果您希望避免解析表,可以直接使用
read_html("https://www.pro-football-reference.com/teams/pit/1933.htm") %>%
html_elements(xpath = "//td[@data-stat='team_record']") %>%
html_text()它将从该列中提取所有值,然后您可以获取最后一个值。
[1] "0-1" "1-1" "1-2" "2-2" "2-3" "2-3-1" "3-3-1" "3-3-2" "3-4-2" "3-5-2" "3-6-2"https://stackoverflow.com/questions/71132948
复制相似问题