我正在尝试从一个站点(thenumbers.com)中抓取数据,在这个站点中,数据会在许多网页上被破坏。顺序网页的格式如下(以下仅为前三页):
url0 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time"
url1 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/101"
url2 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/201"要将第一个顺序url (url0)刮到df中,此代码将返回正确的输出。
library(rvest)
webpage <- read_html("https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[1] %>%
html_table(fill = TRUE)
df <- tbls_ls[[1]]其中的输出看起来像:
> head(df)
Rank Released Movie DomesticBox Office
1 1 2015 Star Wars Ep. VII: The Force Awakens $936,662,225
2 2 2009 Avatar $760,507,625
3 3 2018 Black Panther $700,059,566在我们到达数据的末尾之前,我如何能够自动地抓取后续的urls,从而使输出成为一个rowbind()一起编辑的长df?
发布于 2021-08-06 07:18:51
这个问题是在三年前几个月前提出的;然而,这里有一个解决办法。
首先,确定是否允许抓取网站总是一个好主意。在R中,我们可以使用robotstxt包:
robotstxt::paths_allowed("https://www.the-numbers.com")
www.the-numbers.com
[1] TRUE好了,我们可以走了。此外,我还想重申@hrbrmstr所指出的关于捐赠(甚至是最小数额)的做法,以支持网站背后的人(或任何其他类似的网站)正在做的事情。
下面我定义的刮取函数使用R中的重复/if结构(类似于其他编程语言中的do-while循环)。另外,由于需要刮取的页面数量未知,所以函数有一个page_count参数,默认情况下是Inf。这样做会刮掉网站上的所有页面。但是,如果一个人想刮10页,那么他们可以设置page_count = 10。以下是函数定义:
# Load packages ----
pacman::p_load(
rvest,
glue,
stringr,
dplyr,
cli
)
# Custom function ----
scrape_data <- function(url, page_count = Inf){
i <- 1
data_list <- list()
repeat {
html <- read_html(url)
data_list[[i]] <- html %>%
html_element(css = "table") %>%
html_table()
current_page <- html %>%
html_element(css = "div.pagination > a.active") %>%
html_text() %>%
str_remove_all(pattern = "\\,")
all_displayed_pages <- html %>%
html_elements(css = "div.pagination > a") %>%
html_text() %>%
str_remove_all(pattern = "\\,") %>%
str_extract(pattern = "\\d+\\-\\d+")
all_pages_urls <- html %>%
html_elements(css = "div.pagination > a") %>%
html_attr(name = "href")
url <- glue("https://www.the-numbers.com{all_pages_urls[which(current_page == all_displayed_pages)+1]}")
cli_alert_success(glue("Scraped page: {i}"))
i <- i + 1
if(
current_page == all_displayed_pages[length(all_displayed_pages)] |
i - 1 == page_count
){
break
}
}
bind_rows(data_list)
}现在,让我们使用这个函数来抓取表的前5页:
scrape_data(
url = "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time",
page_count = 5
)
√ Scraped page: 1
√ Scraped page: 2
√ Scraped page: 3
√ Scraped page: 4
√ Scraped page: 5
# A tibble: 500 x 7
Rank Year Movie Distributor `DomesticBox Of~ `InternationalB~ `WorldwideBox O~
<int> <int> <chr> <chr> <chr> <chr> <chr>
1 1 2015 Star Wars Ep. VII: The Force Awakens Walt Disney $936,662,225 $1,127,953,592 $2,064,615,817
2 2 2019 Avengers: Endgame Walt Disney $858,373,000 $1,939,427,564 $2,797,800,564
3 3 2009 Avatar 20th Cent… $760,507,625 $2,085,391,916 $2,845,899,541
4 4 2018 Black Panther Walt Disney $700,059,566 $636,434,755 $1,336,494,321
5 5 2018 Avengers: Infinity War Walt Disney $678,815,482 $1,365,725,041 $2,044,540,523
6 6 1997 Titanic Paramount… $659,363,944 $1,548,622,601 $2,207,986,545
7 7 2015 Jurassic World Universal $652,306,625 $1,017,673,342 $1,669,979,967
8 8 2012 The Avengers Walt Disney $623,357,910 $891,742,301 $1,515,100,211
9 9 2017 Star Wars Ep. VIII: The Last Jedi Walt Disney $620,181,382 $711,453,759 $1,331,635,141
10 10 2018 Incredibles 2 Walt Disney $608,581,744 $634,223,615 $1,242,805,359
# ... with 490 more rows该功能的一个可能的改进是添加了一些时间不使用Sys.sleep(3) (3秒),以防服务器将您踢出网站,因为您试图点击它太多次过快。
https://stackoverflow.com/questions/52897280
复制相似问题