文章/答案/技术大牛

发布

社区首页 >问答首页 >使用R中的htmlParse()忽略不存在的URL

问使用R中的htmlParse()忽略不存在的URL
EN

Stack Overflow用户

提问于 2014-02-28 03:45:54

回答 2查看 621关注 0票数 0

大家好，

我有一个很长的地名列表(大约15000个)，我想用来查找wiki页面并从中提取数据。不幸的是，并不是所有的地方都有wiki页面，当htmlParse()击中它们时，它会停止函数并返回一个错误。

    Error: failed to load HTTP resource

我无法浏览并删除创建一个不存在的URL的每个地名，所以我想知道是否有一种方法可以跳过没有wiki页面的地方？

    # Town names to be used
    towns <- data.frame('recID' = c('G62', 'G63', 'G64', 'G65'), 
                    'state' = c('Queensland', 'South_Australia', 'Victoria', 'Western_Australia'),
                    'name'  = c('Balgal Beach', 'Balhannah', 'Ballan', 'Yunderup'),
                    'feature' = c('POPL', 'POPL', 'POPL', 'POPL'))

    towns$state <- as.character(towns$state)

    towns$name <- sub(' ', '_', as.character(towns$name))

   # Function that extract data from wiki
   wiki.tables <- function(towns)  {
      require(RJSONIO)
      require(XML)
      u <- paste('http://en.wikipedia.org/wiki/',
                 sep = '', towns[,1], ',_', towns[,2])
      res <- lapply(u, function(x) htmlParse(x))
      tabs <- lapply(sapply(res, getNodeSet, path = '//*[@class="infobox vcard"]')
             , readHTMLTable)
      return(tabs)
    }

    # Now to run the function. Yunderup will produce a URL that 
    # doesn't exist. So this will result in the error.
    test <- wiki.tables(towns[,c('name', 'state')])

    # It works if I don't include the place that produces a non-existent URL.
    test <- wiki.tables(towns[1:3,c('name', 'state')])

有没有一种方法可以识别这些不存在的URL，或者跳过它们或者删除它们？

谢谢你的帮助！

干杯，亚当

html-parsing

web-scraping

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-02-28 22:34:53

下面是另一个使用httr包的选项。(顺便说一句:你不需要RJSONIO)。将您的wiki.tables(...)函数替换为：

wiki.tables <- function(towns)  {
  require(httr)
  require(XML)
  get.HTML<- function(url){
    resp <- GET(url)
    if (resp$status_code==200) return(htmlParse(content(resp,type="text")))
  }
  u <- paste('http://en.wikipedia.org/wiki/',
             sep = '', towns[,1], ',_', towns[,2])
  res <- lapply(u, get.HTML)
  res <- res[sapply(res,function(x)!is.null(x))]   # remove NULLs
  tabs <- lapply(sapply(res, getNodeSet, path = '//*[@class="infobox vcard"]')
                 , readHTMLTable)
  return(tabs)
}

它运行一个GET请求并测试状态代码。url.exists(...)的缺点是您必须查询每个url两次:一次是为了查看它是否存在，另一次是为了获取数据。

顺便说一句，当我尝试您的代码时，Yunderup url实际上存在吗?？

票数 2

Stack Overflow用户

发布于 2014-02-28 04:32:55

You can use the 'url.exists' function from `RCurl`

require(RCurl)
u <- paste('http://en.wikipedia.org/wiki/',
                 sep = '', towns[,'name'], ',_', towns[,'state'])
> sapply(u, url.exists)
   http://en.wikipedia.org/wiki/Balgal_Beach,_Queensland 
                                                    TRUE 
 http://en.wikipedia.org/wiki/Balhannah,_South_Australia 
                                                    TRUE 
           http://en.wikipedia.org/wiki/Ballan,_Victoria 
                                                    TRUE 
http://en.wikipedia.org/wiki/Yunderup,_Western_Australia 
                                                    TRUE

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/22085773

复制

相似问题

问使用R中的htmlParse()忽略不存在的URL
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用R中的htmlParse()忽略不存在的URLEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用R中的htmlParse()忽略不存在的URL
EN