文章/答案/技术大牛

发布

社区首页 >问答首页 >在没有明显分页类的情况下使用R和Rvest抓取html表

问在没有明显分页类的情况下使用R和Rvest抓取html表
EN

Stack Overflow用户

提问于 2018-10-19 17:28:31

回答 1查看 553关注 0票数 1

我正在尝试从一个站点(thenumbers.com)中抓取数据，在这个站点中，数据会在许多网页上被破坏。顺序网页的格式如下(以下仅为前三页)：

url0 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time"
url1 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/101"
url2 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/201"

要将第一个顺序url (url0)刮到df中，此代码将返回正确的输出。

library(rvest)

webpage <- read_html("https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time")

tbls <- html_nodes(webpage, "table")

head(tbls)

tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[1] %>%
  html_table(fill = TRUE)

df <- tbls_ls[[1]]

其中的输出看起来像：

> head(df)
  Rank Released                                Movie DomesticBox Office
1    1     2015 Star Wars Ep. VII: The Force Awakens       $936,662,225
2    2     2009                               Avatar       $760,507,625
3    3     2018                        Black Panther       $700,059,566

在我们到达数据的末尾之前，我如何能够自动地抓取后续的urls，从而使输出成为一个rowbind()一起编辑的长df？

rvest

web-scraping

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-08-06 07:18:51

这个问题是在三年前几个月前提出的；然而，这里有一个解决办法。

首先，确定是否允许抓取网站总是一个好主意。在R中，我们可以使用robotstxt包：

robotstxt::paths_allowed("https://www.the-numbers.com")
 www.the-numbers.com                      

[1] TRUE

好了，我们可以走了。此外，我还想重申@hrbrmstr所指出的关于捐赠(甚至是最小数额)的做法，以支持网站背后的人(或任何其他类似的网站)正在做的事情。

下面我定义的刮取函数使用R中的重复/if结构(类似于其他编程语言中的do-while循环)。另外，由于需要刮取的页面数量未知，所以函数有一个page_count参数，默认情况下是Inf。这样做会刮掉网站上的所有页面。但是，如果一个人想刮10页，那么他们可以设置page_count = 10。以下是函数定义：

# Load packages ----

pacman::p_load(
  rvest,
  glue,
  stringr,
  dplyr,
  cli
)

# Custom function ----

scrape_data <- function(url, page_count = Inf){
  
  i <- 1
  data_list <- list()
  
  repeat {
    
    html <- read_html(url) 
    
    data_list[[i]] <- html %>%
      html_element(css = "table") %>%
      html_table()
    
    current_page <- html %>%
      html_element(css = "div.pagination > a.active") %>%
      html_text() %>%
      str_remove_all(pattern = "\\,")
    
    all_displayed_pages <- html %>%
      html_elements(css = "div.pagination > a") %>%
      html_text() %>%
      str_remove_all(pattern = "\\,") %>%
      str_extract(pattern = "\\d+\\-\\d+")
    
    all_pages_urls <- html %>%
      html_elements(css = "div.pagination > a") %>%
      html_attr(name = "href")
    
    url <- glue("https://www.the-numbers.com{all_pages_urls[which(current_page == all_displayed_pages)+1]}")
    cli_alert_success(glue("Scraped page: {i}"))
    
    i <- i + 1
    
    if(
      current_page == all_displayed_pages[length(all_displayed_pages)] |
      i - 1 == page_count
    ){
      break
    }
  }
  
  bind_rows(data_list)
  
}

现在，让我们使用这个函数来抓取表的前5页：

scrape_data(
  url = "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time",
  page_count = 5
)

√ Scraped page: 1
√ Scraped page: 2
√ Scraped page: 3
√ Scraped page: 4
√ Scraped page: 5
# A tibble: 500 x 7
    Rank  Year Movie                                Distributor `DomesticBox Of~ `InternationalB~ `WorldwideBox O~
   <int> <int> <chr>                                <chr>       <chr>            <chr>            <chr>           
 1     1  2015 Star Wars Ep. VII: The Force Awakens Walt Disney $936,662,225     $1,127,953,592   $2,064,615,817  
 2     2  2019 Avengers: Endgame                    Walt Disney $858,373,000     $1,939,427,564   $2,797,800,564  
 3     3  2009 Avatar                               20th Cent…  $760,507,625     $2,085,391,916   $2,845,899,541  
 4     4  2018 Black Panther                        Walt Disney $700,059,566     $636,434,755     $1,336,494,321  
 5     5  2018 Avengers: Infinity War               Walt Disney $678,815,482     $1,365,725,041   $2,044,540,523  
 6     6  1997 Titanic                              Paramount…  $659,363,944     $1,548,622,601   $2,207,986,545  
 7     7  2015 Jurassic World                       Universal   $652,306,625     $1,017,673,342   $1,669,979,967  
 8     8  2012 The Avengers                         Walt Disney $623,357,910     $891,742,301     $1,515,100,211  
 9     9  2017 Star Wars Ep. VIII: The Last Jedi    Walt Disney $620,181,382     $711,453,759     $1,331,635,141  
10    10  2018 Incredibles 2                        Walt Disney $608,581,744     $634,223,615     $1,242,805,359  
# ... with 490 more rows

该功能的一个可能的改进是添加了一些时间不使用Sys.sleep(3) (3秒)，以防服务器将您踢出网站，因为您试图点击它太多次过快。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52897280

复制

相似问题

问在没有明显分页类的情况下使用R和Rvest抓取html表
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在没有明显分页类的情况下使用R和Rvest抓取html表EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在没有明显分页类的情况下使用R和Rvest抓取html表
EN