我正在尝试使用Rvest提取一个YouTube视频描述。我知道只使用API会更容易,但最终的目标是更加熟悉Rvest,而不是仅仅获得视频描述。这就是我迄今为止所做的:
# defining website
page <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
# setting Xpath
Xp <- '/html/body/div[2]/div[4]/div/div[5]/div[2]/div[2]/div/div[2]/meta[2]'
# getting page
Website <- read_html(page)
# printing description
html_attr(Description, name = "content")虽然这确实指向视频描述,但我没有获得完整的视频描述,而是在几行之后被切断的字符串:
[1] "The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johson in his first major speech of the campaign said a..."预期产出将是完整的描述。
"The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johnson in his first major speech of the campaign said a Conservative government would unite the country and "level up" the prospects for people with massive investment in health, better infrastructure, more police, and a green revolution. But he said the key issue to solve was Brexit. Meanwhile Labour vowed to outspend the Tories on the NHS in England.
Labour leader Jeremy Corbyn has also faced questions over his position on allowing a second referendum on Scottish independence. Today at the start of a two-day tour of Scotland, he said wouldn't allow one in the first term of a Labour government but later rowed back saying it wouldn't be a priority in the early years.
Sophie Raworth presents tonight's BBC News at Ten and unravels the day's events with the BBC's political editor Laura Kuenssberg, health editor Hugh Pym and Scotland editor Sarah Smith.
Please subscribe HERE: LINK"有什么办法能用rvest得到完整的描述吗?
发布于 2019-11-18 09:33:00
正如您所说的,您专注于学习,在展示代码之后,我添加了一些说明我是如何到达那里的。
可复制代码:
library(rvest)
library(magrittr)
url <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
url %>%
read_html %>%
html_nodes(xpath = "//*[@id = 'eow-description']") %>%
html_text解释:
1.定位元素
有几种方法可以解决这个问题。一个常见的第一步是右键单击浏览器中的目标元素,然后选择“检查元素”。你会看到这样的事情:

接下来,您可以尝试提取数据。
url %>%
read_html %>%
html_nodes(xpath = "//*[@id = 'description']")不幸的是,这在你的情况下行不通。
2.确保您有正确的源
因此,您必须确保目标数据在所加载的文档中。您可以在浏览器的网络活动中看到这一点,或者如果您更愿意在R中进行检查,我为此编写了一个小函数:
showHtmlPage <- function(doc){
tmp <- tempfile(fileext = ".html")
doc %>% toString %>% writeLines(con = tmp)
tmp %>% browseURL(browser = rstudioapi::viewer)
}用法:
url %>% read_html %>% showHtmlPage您将看到目标数据实际上在您下载的文档中。所以你可以坚持使用rvest。接下来,您必须找到xpath (或css),.
3.在下载的文档中找到目标标记
您可以搜索包含要查找的文本的标记。
doc %>% html_nodes(xpath = "//*[contains(text(), 'The Conservatives and ')]")产出将是:
{xml_nodeset (1)}
[1] <p id="eow-description" class="">The Conservatives and Labour have ....在这里,您可以看到您正在寻找一个带有id eow-description的标记。
https://stackoverflow.com/questions/58861911
复制相似问题