我已经使用Rcrawler提取语音链接,如下所示:
speech_links = Rcrawler::LinkExtractor("https://www.federalreserve.gov/newsevents/speeches.htm", urlregexfilter = "https://www.federalreserve.gov/newsevents/speech/")获取其中一个链接(年份),并在那一年获得所有相同的语音链接
speech_links_2020 = Rcrawler::LinkExtractor(speech_links$InternalLinks[1])这给出了这一年的所有链接,现在我只知道如何检索演讲标题、演讲者、时间和其他属性
我知道用于标题的代码是:
Rcrawler::ContentScraper(speech_links_2020$InternalLinks[2], XpathPatterns = "//head/title")然而,对于其他属性,扬声器,时间和内容,我不确定如何指定XpathPatterns,因为我不熟悉HTML。
有没有人可以让我知道同样的事情。
谢谢
发布于 2020-01-25 22:38:35
您可以使用以下函数获得一个数据框,其中包含日期、标题、演讲者和会场的列,这些列都是从页面中抓取的,其中包含适当的xpath术语。你所要做的就是在你想要的年份给它喂食。
请注意,我使用的是rvest包,因为Rcrawler包似乎没有使用这些相同的xpath返回正确的结果:
library(rvest)
library(tibble)
get_fed_speeches <- function(year = substr(date(), 21, 25))
{
scrape <- function(x, page) html_text(html_nodes(read_html(page), xpath = x))
page <- paste0("https://www.federalreserve.gov/newsevents/speech/",
year,
"-speeches.htm")
xpaths <- list(date = "//time",
title = "//div[@class = 'row eventlist']//em",
speaker = "//p[@class = 'news__speaker']",
venue = "//p[@class = 'news__speaker']/following-sibling::p")
as_tibble(lapply(xpaths, scrape, page))
}请注意,它默认为当前年份,因此要获得2020年的演讲,只需执行以下操作:
get_fed_speeches()
# A tibble: 4 x 4
# date title speaker venue
# <chr> <chr> <chr> <chr>
# 1 1/17/2~ Spontaneity and Order: Trans~ Vice Chair for S~ At the American Bar Associa~
# 2 1/16/2~ The Outlook for Housing Governor Michell~ At the 2020 Economic Foreca~
# 3 1/9/20~ U.S. Economic Outlook and Mo~ Vice Chair Richa~ At the C. Peter McColough S~
# 4 1/8/20~ Strengthening the Community ~ Governor Lael Br~ At the Urban Institute, Was~或者2015年这样做:
get_fed_speeches(2015)
#> # A tibble: 54 x 4
#> date title speaker venue
#> <chr> <chr> <chr> <chr>
#> 1 12/3/2~ Financial Stability and Sha~ Vice Chairman~ "At the \"Financial Stability:~
#> 2 12/2/2~ The Economic Outlook and Mo~ Chair Janet L~ At the Economic Club of Washin~
#> 3 12/2/2~ Opening Remarks Governor Dani~ At the Economic Growth and Reg~
#> 4 12/1/2~ Normalizing Monetary Policy~ Governor Lael~ At the Stanford Institute for ~
#> 5 11/20/~ Opening Remarks Governor Jero~ At the 2015 Roundtable on Trea~
#> 6 11/19/~ Emerging Asia in Transition Vice Chairman~ "At the \"Policy Challenges in~
#> 7 11/17/~ Thinking Critically about N~ Governor Dani~ At the Brookings Institution, ~
#> 8 11/17/~ Central Clearing in an Inte~ Governor Jero~ At the Clearing House Annual C~
#> 9 11/12/~ The Transmission of Exchang~ Vice Chairman~ "At the \"Monetary Policy Impl~
#> 10 11/12/~ Welcoming Remarks Chair Janet L~ "At the \"Monetary Policy Impl~
#> # ... with 44 more rows发布于 2020-01-25 22:04:16
用演讲来解析json可能会更容易。
https://www.federalreserve.gov/json/ne-speeches.json
演讲者ID在此处:https://www.federalreserve.gov/json/nespeakers.json
library(httr)
library(tidyverse)
json <- GET("https://www.federalreserve.gov/json/ne-speeches.json")
speeches <- content(json) %>%
bind_rows()https://stackoverflow.com/questions/59906716
复制相似问题