问避免用R网抓取“粘”字
EN

Stack Overflow用户

提问于 2017-05-15 13:09:04

回答 1查看 37关注 0票数 0

当我同时使用以下两个代码块时，我会得到“粘合”的单词，我指的是不被空格分隔但它们应该分开的单词，这是一个问题。在最初的HTML中，它们似乎是由一个<b>隔开的，而我无法处理这个问题。这两个街区以不同的方式做同样的事情。

library(XML)
library(RCurl)
# Block 1---------
url <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
u <- readLines(url)
h <- htmlTreeParse(file=u,  
               asText=TRUE, 
               useInternalNodes = TRUE, 
               encoding = "utf-8")

song <- getNodeSet(doc=h, path="//article", fun=xmlValue)

# Block 2---------
u <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
h <- htmlParse(getURL(u))
song <- xpathSApply(h, path = "//article", fun = xmlValue)

它返回的内容如下：

1“有时我feelLike，我没有我唯一的朋友--我住的城市-我唯一的朋友是我住的城市-我住的城市/code>believeThat，，外面没有人，believeThat很难，我是aloneAt.

web-scraping

html

回答 1

Stack Overflow用户

发布于 2021-12-15 13:46:18

我能够用以下代码检索单词：

library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.letras.mus.br/red-hot-chili-peppers/32739/")
remDr$screenshot(display = TRUE, useViewer = TRUE) 
page_Content <- remDr$getPageSource()[[1]]

list_Text_Song <- list()

for(i in 1 : 30)
{
  print(i)
  web_Obj <- tryCatch(remDr$findElement("xpath", paste0("//*[@id='js-lyric-cnt']/article/div[2]/div[2]/p[", i, "]")), error = function(e) NA)
  list_Text_Song[[i]] <- tryCatch(web_Obj$getElementText(), error = function(e) NA)
}

list_Text_Song <- unlist(list_Text_Song)
list_Text_Song <- list_Text_Song[!is.na(list_Text_Song)]

这些话并没有被这种方法粘住。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43980432

复制

相似问题

问避免用R网抓取“粘”字
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问避免用R网抓取“粘”字EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问避免用R网抓取“粘”字
EN