首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从urls列表中获取多个HTML页面

从urls列表中获取多个HTML页面
EN

Stack Overflow用户
提问于 2015-03-06 19:55:13
回答 1查看 3.3K关注 0票数 4

我有一个像这样的数据文件:

代码语言:javascript
复制
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
          "http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

    country link
1   Canada  http://en.wikipedia.org/wiki/United_States
2   US      http://en.wikipedia.org/wiki/Canada
3   Japan   http://en.wikipedia.org/wiki/Japan
4   China   http://en.wikipedia.org/wiki/China

使用rvest,我希望为每个url抓取目录,并将它们绑定到一个输出。

此代码提取一个url的目录:

代码语言:javascript
复制
library(rvest)
toc <- html(url) %>%
  html_nodes(".toctext") %>%
  html_text()

期望产出:

代码语言:javascript
复制
country toc
US      Etymology
        History
        Native American and European contact
        Settlements
        ...  
Canada  Etymology
        History
        Aboriginal peoples
        European colonization
        ...etc
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-03-06 20:05:26

这将把它们刮到一个完整的数据帧中(每个TOC条目有一行)。留给OP的冗长但简单的“打印/输出”代码:

代码语言:javascript
复制
library(rvest)
library(dplyr)

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", 
         "http://en.wikipedia.org/wiki/Canada",
         "http://en.wikipedia.org/wiki/Japan", 
         "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

bind_rows(lapply(url, function(x) {

  data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
    html_nodes(".toctext") %>%
    html_text())

})) -> toc_entries

df <- toc_entries %>% left_join(df)

df[sample(nrow(df), 10),]

## Source: local data frame [10 x 3]
## 
##                                           url                            toc_entry country
## 1          http://en.wikipedia.org/wiki/Japan                   Government finance   Japan
## 2         http://en.wikipedia.org/wiki/Canada        Cold War and civil rights era      US
## 3  http://en.wikipedia.org/wiki/United_States                                 Food  Canada
## 4          http://en.wikipedia.org/wiki/Japan                               Sports   Japan
## 5         http://en.wikipedia.org/wiki/Canada                             Religion      US
## 6          http://en.wikipedia.org/wiki/China        Cold War and civil rights era   China
## 7          http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts   Japan
## 8  http://en.wikipedia.org/wiki/United_States                           Population  Canada
## 9          http://en.wikipedia.org/wiki/Japan                          Settlements   Japan
## 10        http://en.wikipedia.org/wiki/Canada                             Military      US
票数 5
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/28906601

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档