文章/答案/技术大牛

发布

社区首页 >问答首页 >R中的Web爬行

问R中的Web爬行
EN

Stack Overflow用户

提问于 2020-01-23 10:02:52

回答 2查看 1.5K关注 0票数 1

我试图得到的发言，在链接指定与标题在网页"https://www.federalreserve.gov/newsevents/speeches.htm“。

例如，页面上的第一个标题是“自发性和有序性:银行监管的透明度、问责制和公平性”--如果点击它，就会领导相应的演讲。

谁能让我知道如何使用Rcrawler下载所有这些带有标题和日期的演讲稿？

谢谢贾拉杰

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-01-23 12:23:30

在一个问题上，这是一个很大的问题，但这是一个有趣的问题，所以我想无论如何我还是要解决这个问题。这就是它所导致的。

Tidyverse/rvest版本

首先，我将在Tidyverse中构建这个刮板，因为我熟悉使用它进行网络抓取。因此，我们将从加载必需的包开始。

library(tidyverse)
library(rvest)

这一问题的一个具有挑战性的方面是，没有单一的页面包含与发言的所有页面的链接。然而，如果我们从主页上浏览链接，我们就会发现有一套链接可以链接到任何一年的所有演讲稿的页面。说清楚一点，我在主页上没有看到这些链接。相反，我是通过抓取主页；使用html_nodes("a")查看"a“类型的节点来发现它们的，因为Chrome中的一次检查告诉我，在那里找到了相关的链接；使用html_attr("href")从这些结果中提取urls，然后在控制台中查看结果，看看什么看起来有用。在这些结果中，我看到了表单"newsevents/speech2020-speeches.htm"和"newsevents/speech2007speeches.htm"的链接，当我在这些链接上运行相同的过程时，我看到了指向单个演讲的链接。所以：

# scrape the main page
base_page <- read_html("https://www.federalreserve.gov/newsevents/speeches.htm")

# extract links to those annual archives from the resulting html
year_links <- base_page %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  # the pattern for those annual pages changes, so we can use this approach to get both types
  map(c("/newsevents/speech/[0-9]{4}-speeches.htm", "/newsevents/speech/[0-9]{4}speech.htm"), str_subset, string = .) %>%
  reduce(union)

# here's what that produces
> year_links
 [1] "/newsevents/speech/2020-speeches.htm" "/newsevents/speech/2019-speeches.htm" "/newsevents/speech/2018-speeches.htm" "/newsevents/speech/2017-speeches.htm"
 [5] "/newsevents/speech/2016-speeches.htm" "/newsevents/speech/2015-speeches.htm" "/newsevents/speech/2014-speeches.htm" "/newsevents/speech/2013-speeches.htm"
 [9] "/newsevents/speech/2012-speeches.htm" "/newsevents/speech/2011-speeches.htm" "/newsevents/speech/2010speech.htm"    "/newsevents/speech/2009speech.htm"   
[13] "/newsevents/speech/2008speech.htm"    "/newsevents/speech/2007speech.htm"    "/newsevents/speech/2006speech.htm"

好的，现在我们要刮掉那些年度页面的链接到各个演讲的页面，使用map来迭代单个链接上的一个过程。

speech_links <- map(year_links, function(x) {

  # the scraped links are incomplete, so we'll start by adding the missing bit
  full_url <- paste0("https://www.federalreserve.gov", x)

  # now we'll essentially rerun the process we ran on the main page, only now we can
  # focus on a single string pattern, which again I found by trial and error (i.e.,
  # scrape the page, look at the hrefs on it, see which ones look relevant, check
  # one out in my browser to confirm, then use str_subset() to get ones matching that pattern
  speech_urls <- read_html(full_url) %>%
    html_nodes("a") %>%
    html_attr("href") %>%
    str_subset(., "/newsevents/speech/")

  # add the header now
  return(paste0("https://www.federalreserve.gov", speech_urls))

})

# unlist the results so we have one long vector of links to speeches instead of a list
# of vectors of links
speech_links <- unlist(speech_links)

# here's what the results of that process look like
> head(speech_links)
[1] "https://www.federalreserve.gov/newsevents/speech/quarles20200117a.htm"  "https://www.federalreserve.gov/newsevents/speech/bowman20200116a.htm"  
[3] "https://www.federalreserve.gov/newsevents/speech/clarida20200109a.htm"  "https://www.federalreserve.gov/newsevents/speech/brainard20200108a.htm"
[5] "https://www.federalreserve.gov/newsevents/speech/brainard20191218a.htm" "https://www.federalreserve.gov/newsevents/speech/brainard20191126a.htm"

现在，最后，我们将迭代单个演讲的页面上的刮取过程，以便使用每个演讲的关键元素:日期、标题、演讲者、位置和全文。通过在我的Chrome浏览器中打开一个页面，右键单击(我在Windows机器上)，并使用“检查”来查看与各个位相关联的html，我找到了每个想要的元素的节点类型。

speech_list <- map(speech_links, function(x) {

  Z <- read_html(x)

  # scrape the date and convert it to 'date' class while we're at it
  date <- Z %>% html_nodes("p.article__time") %>% html_text() %>% as.Date(., format = "%B %d, %Y")

  title <- Z %>% html_nodes("h3.title") %>% html_text()

  speaker <- Z %>% html_nodes("p.speaker") %>% html_text()

  location <- Z %>% html_nodes("p.location") %>% html_text()

  # this one's a little more involved because the text at that node had two elements, 
  # of which we only wanted the second, and I went ahead and cleaned up the speech 
  # text a bit here to make the resulting column easy to work with later
  text <- Z %>%
    html_nodes("div.col-xs-12.col-sm-8.col-md-8") %>%
    html_text() %>%
    .[2] %>%
    str_replace_all(., "\n", "") %>%
    str_trim(., side = "both")

  return(tibble(date, title, speaker, location, text))

})

# finally, bind the one-row elements of that list into a single tibble
speech_table <- bind_rows(speech_list)

以下是美联储从2006年到现在发表的804次演讲的一瞥：

> str(speech_table)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       804 obs. of  5 variables:
 $ date    : Date, format: "2020-01-17" "2020-01-16" "2020-01-09" "2020-01-08" ...
 $ title   : chr  "Spontaneity and Order: Transparency, Accountability, and Fairness in Bank Supervision" "The Outlook for Housing" "U.S. Economic Outlook and Monetary Policy" "Strengthening the Community Reinvestment Act by Staying True to Its Core Purpose" ...
 $ speaker : chr  "Vice Chair for Supervision Randal K. Quarles" "Governor Michelle W. Bowman" "Vice Chair Richard H. Clarida" "Governor Lael Brainard" ...
 $ location: chr  "At the American Bar Association Banking Law Committee Meeting 2020, Washington, D.C." "At the 2020 Economic Forecast Breakfast, Home Builders Association of Greater Kansas City, Kansas City, Missouri" "At the C. Peter McColough Series on International Economics, Council on Foreign Relations, New York, New York" "At the Urban Institute, Washington, D.C." ...
 $ text    : chr  "It's a great pleasure to be with you today at the ABA Banking Law Committee's annual meeting. I left the practi"| __truncated__ "Few sectors are as central to the success of our economy and the lives of American families as housing. If we i"| __truncated__ "Thank you for the opportunity to join you bright and early on this January 2020 Thursday morning. As some of yo"| __truncated__ "Good morning. I am pleased to be here at the Urban Institute to discuss how to strengthen the Community Reinves"| __truncated__ ...

爬虫版

现在，您专门询问了如何使用Rcrawler包(而不是rvest )来执行此操作，因此这里有一个使用前者的解决方案。

首先，我们将使用带有正则表达式的Rcrawler的LinkExtractor函数来为按年链接到演讲稿的页面抓取urls。请注意，我只知道在regex中查找什么，因为我已经在html中搜索了rvest解决方案。

library(Rcrawler)

year_links = LinkExtractor("https://www.federalreserve.gov/newsevents/speeches.htm",
  urlregexfilter = "https://www.federalreserve.gov/newsevents/speech/")

现在，我们可以使用lapply迭代LinkExtractor，以遍历该过程的结果，每年都会刮取到各个演讲的链接的批次。同样，我们将使用regex来聚焦我们的抓取，我们只知道在regex中使用什么模式，因为我们已经查看了上一步的结果，并在浏览器中查看了其中的一些页面。

speech_links <- lapply(year_links$InternalLinks, function(i) {

   linkset <- LinkExtractor(i, urlregexfilter = "speech/[a-z]{1,}[0-9]{8}a.htm")

   # might as well limit the results to the vector of interest while we're here
   return(linkset$InternalLinks)

})

# that process returns a list of vectors, so let's collapse that list into one
# long vector of urls for pages with individual speeches
speech_links <- unlist(speech_links)

最后，我们可以将ContentScraper函数应用到指向单个演讲的链接向量中，以提取数据。对其中一个页面的html检查显示了与感兴趣的部分相关的CSS模式，因此我们将使用CssPatterns获取这些比特，并使用PatternsName为它们命名。该调用返回一个数据列表列表，因此我们将通过使用do.call(rbind.data.frame, ...)和stringsAsFactors = FALSE将列表列表转换为单个数据框架来完成，以避免将所有内容转换为因子。

DATA <- ContentScraper(Url = speech_links,
                       CssPatterns = c(".article__time", ".location", ".speaker", ".title", ".col-xs-12.col-sm-8.col-md-8"),
                       PatternsName = c("date", "location", "speaker", "title", "text"),
                       # we need this next line to get both elements for the .col-xs-12.col-sm-8.col-md-8
                       # bit, which is the text of the speech itself. the first element
                       # is just a repeat of the header info
                       ManyPerPattern = TRUE)

# because the text element is a vector of two strings, we'll want to flatten the
# results into a one-row data frame to make the final concatenation easier. this
# gives us a row with two cols for text, text1 and text2, where text2 is the part
# you really want
DATA2 <- lapply(DATA, function(i) { data.frame(as.list(unlist(i)), stringsAsFactors = FALSE) })

# finally, collapse those one-row data frames into one big data frame, one row per speech
output <- do.call(rbind.data.frame, c(DATA2, stringsAsFactors = FALSE))

这里需要注意的三件事是: 1)这个表只有779行，而我们使用rvest获得的表只有806行，我不知道为什么会有差异；2)这个表中的数据仍然是原始的，需要进行一些清理(例如，将日期字符串转换为date类，整理文本列中的字符串)，这可以使用sapply完成；以及3)您可能希望删除多余的text1列，这可以在基R中用output$text1 <- NULL完成。

票数 2

Stack Overflow用户

发布于 2020-01-25 18:10:36

从网页中，您可以看到所有的链接和信息都包含在网页上的json中。直接从json提取内容可能比呈现网页更容易，然后尝试提取其内容：https://www.federalreserve.gov/json/ne-speeches.json。

library(httr)
library(tidyverse)
library(rvest)

json <- GET("https://www.federalreserve.gov/json/ne-speeches.json")

speeches <- content(json) %>% # json from the webpage contains urls to speeches
  bind_rows() %>%
  transmute(Name = t,
            url = str_replace(l, "//", "/"), # trying to get rid of the "//" at the beginning of the url
            url = paste0("https://www.federalreserve.gov/", url)) %>%
  filter(!is.na(Name)) # filtering NA as the last row of the json is not a valid speech

speeches$speech_transcript <- "" # making sure the column speech_transcript works before I try to assign its values in the loop

for (i in 1:nrow(speeches)) { # going through urls and getting the text of the speeches
  speeches[i,]$speech_transcript <- read_html(speeches[i,]$url) %>%
    html_node("#content") %>%
    html_node("#article") %>%
    html_node("div:nth-child(3)") %>%
    html_text() %>%
    str_squish() # getting rid of multiple spaces etc.
print(i)
}

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59875844

复制

相似问题

问R中的Web爬行
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R中的Web爬行EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R中的Web爬行
EN