当我同时使用以下两个代码块时,我会得到“粘合”的单词,我指的是不被空格分隔但它们应该分开的单词,这是一个问题。在最初的HTML中,它们似乎是由一个<b>隔开的,而我无法处理这个问题。这两个街区以不同的方式做同样的事情。
library(XML)
library(RCurl)
# Block 1---------
url <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
u <- readLines(url)
h <- htmlTreeParse(file=u,
asText=TRUE,
useInternalNodes = TRUE,
encoding = "utf-8")
song <- getNodeSet(doc=h, path="//article", fun=xmlValue)
# Block 2---------
u <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
h <- htmlParse(getURL(u))
song <- xpathSApply(h, path = "//article", fun = xmlValue)它返回的内容如下:
1“有时我feelLike,我没有我唯一的朋友--我住的城市-我唯一的朋友是我住的城市-我住的城市/code>believeThat,,外面没有人,believeThat很难,我是aloneAt.
发布于 2021-12-15 13:46:18
我能够用以下代码检索单词:
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.letras.mus.br/red-hot-chili-peppers/32739/")
remDr$screenshot(display = TRUE, useViewer = TRUE)
page_Content <- remDr$getPageSource()[[1]]
list_Text_Song <- list()
for(i in 1 : 30)
{
print(i)
web_Obj <- tryCatch(remDr$findElement("xpath", paste0("//*[@id='js-lyric-cnt']/article/div[2]/div[2]/p[", i, "]")), error = function(e) NA)
list_Text_Song[[i]] <- tryCatch(web_Obj$getElementText(), error = function(e) NA)
}
list_Text_Song <- unlist(list_Text_Song)
list_Text_Song <- list_Text_Song[!is.na(list_Text_Song)]这些话并没有被这种方法粘住。
https://stackoverflow.com/questions/43980432
复制相似问题