我需要从美国环保局的饮用水标准中找出122种化学物质的清单。表格和数据可在以下位置公开获得:http://www.epa.gov/wqc/national-recommended-water-quality-criteria-human-health-criteria-table
我正在尝试使用XML包。
library(XML)
url <- "http://www.epa.gov/wqc/national-recommended-water-quality-criteria-human-health-criteria-table"
classes <- c('character', 'integer', 'FormattedNumber', 'FormattedNumber', 'Integer', 'Character')
USEPA <- readHTMLTable(url,which=1,colClasses=classes,stringAsFactors=F)不幸的是,我只得到以下错误信息:" error : failed to load HTTP resource“
发布于 2019-10-19 06:14:04
如果我点击上面给出的链接,我的浏览器会自动将我带到https站点。
我的猜测是可能没有http version....only https版本。这可能会给XML库带来问题。
这里有一种读取数据的方法,基于这里的博客文章:Using rvest to Scrape an HTML Table
library("rvest")
url <- "https://www.epa.gov/wqc/national-recommended-water-quality-criteria-human-health-criteria-table"
table_list <- url %>%
read_html() %>%
# I copied this Xpath as described in the blog post I linked above
html_nodes(xpath='/html/body/section/div[2]/div[1]/div/div/table') %>%
html_table()
# we have a list, but need to get the first item (the table)
html_table = table_list[[1]]
head(html_table[, 1:2]) # show only first two columns输出:
Pollutant CAS Number
1 Acenaphthene (P) 83329
2 Acrolein (P) 107028
3 Acrylonitrile (P) 107131
4 Aldrin (P) 309002
5 alpha-Hexachlorocyclohexane (HCH) (P) 319846
6 alpha-Endosulfan (P) 959988https://stackoverflow.com/questions/58458536
复制相似问题