我在试着上网-刮一页。但是,有时我的循环不能工作,因为解析器“未能加载HTTP资源”。问题是页面没有加载在我的浏览器中,所以它不是代码的问题。
但是,在为我发现错误的每个页面创建异常之后,不得不重新启动进程是非常烦人的。我想知道是否有办法提出一个假设条件。我正在考虑这样的问题:如果出现错误,那么在下一步重新启动循环。
我在帮助页面中查找htmlParse,发现有一个错误参数,但是无法理解如何使用它。对我的情况有什么想法吗?
下面是一个可重复的例子:
if(require(RCurl) == F) install.packages('RCurl')
if(require(XML) == F) install.packages('XML')
if(require(seqinr) == F) install.packages('seqinr')
for (i in 575:585){
currentPage <- i # define pagina inicial da busca
# Link que ser? procurado
link <- paste("http://www.cnj.jus.br/improbidade_adm/visualizar_condenacao.php?seq_condenacao=",
currentPage,
sep='')
doc <- htmlParse(link, encoding = "UTF-8") #this will preserve characters
tables <- readHTMLTable(doc, stringsAsFactors = FALSE)
if(length(tables) != 0) {
tabela2 <- as.data.frame(tables[10])
tabela2[,1] <- gsub( "\\n", " ", tabela2[,1] )
tabela2[,2] <- gsub( "\\n", " ", tabela2[,2] )
tabela2[,2] <- gsub( "\\t", " ", tabela2[,2] )
listofTabelas[[i]] <- tabela2
tabela1 <- do.call("rbind", listofTabelas)
names(tabela1) <- c("Variaveis", "status")
}
}发布于 2014-02-11 00:46:03
使用httr包可能会更好。
library(httr)
library(XML)
url <- "http://www.cnj.jus.br/improbidade_adm/visualizar_condenacao.php"
for (i in 575:585){
response<- GET(url,path="/",query=c(seq_condenacao=as.character(i)))
if (response$status_code!=200){ # HTTP request failed!!
# do some stuff...
print(paste("Failure:",i,"Status:",response$status_code))
next
}
doc <- htmlParse(response, encoding = "UTF-8")
# do some other stuff
print(paste("Success:",i,"Status:",response$status_code))
}
# [1] "Success: 575 Status: 200"
# [1] "Success: 576 Status: 200"
# [1] "Success: 577 Status: 200"
# [1] "Success: 578 Status: 200"
# [1] "Success: 579 Status: 200"
# [1] "Success: 580 Status: 200"
# [1] "Success: 581 Status: 200"
# [1] "Success: 582 Status: 200"
# [1] "Success: 583 Status: 200"
# [1] "Success: 584 Status: 200"
# [1] "Success: 585 Status: 200"https://stackoverflow.com/questions/21690406
复制相似问题