文章/答案/技术大牛

发布

社区首页 >问答首页 >具有多个子标头的HTML表的刮除

问具有多个子标头的HTML表的刮除
EN

Stack Overflow用户

提问于 2015-09-17 11:34:43

回答 3查看 743关注 0票数 0

我试图使用以下代码在data.frame中导入核试验站点列表(从维基百科的页面)：

library(RCurl)
library(XML)

theurl <- "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Find XPath (go the webpage, right-click inspect element, find table then right-click copyXPath) 
myxpath <- "//*[@id='mw-content-text']/table[2]"

# Extract table header and contents
tablehead <- xpathSApply(pagetree, paste(myxpath,"/tr/th",sep=""), xmlValue)
results <- xpathSApply(pagetree, paste(myxpath,"/tr/td",sep=""), xmlValue)

# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 5, byrow = TRUE))
names(content) <- c("Testing country", "Location", "Site", "Coordinates", "Notes")

然而，有多个子标头阻止data.frame一致填充。我怎么才能解决这个问题？

web-scraping

html-table

回答 3

Stack Overflow用户

回答已采纳

发布于 2015-09-19 09:27:16

请看一下htmltab包。它允许您使用子标题填充新列：

library(htmltab)
tab <- htmltab("https://en.wikipedia.org/wiki/List_of_nuclear_test_sites",
           which = "/html/body/div[3]/div[3]/div[4]/table[2]",
           header = 1 + "//tr/th[@style='background:#efefff;']",
           rm_nodata_cols = F)

票数 1

Stack Overflow用户

发布于 2015-09-17 15:08:41

我发现Carson的这个例子对我很有帮助：

library(rvest)
theurl <- "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites"
# First, grab the page source
content <- html(theurl) %>%
  # then extract the first node with class of wikitable
  html_node(".wikitable") %>% 
  # then convert the HTML table into a data frame
  html_table()

票数 1

Stack Overflow用户

发布于 2015-09-17 14:38:34

你试过这个吗？

l.wiki.url <- getURL( url = "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites" )
l.wiki.par <- htmlParse( file = l.wiki.url )

l.tab.con <- xpathSApply( doc  = l.wiki.par
                        , path = "//table[@class='wikitable']//tr//td"
                        , fun  = xmlValue
                        )

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32629350

复制

相似问题

问具有多个子标头的HTML表的刮除
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问具有多个子标头的HTML表的刮除EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问具有多个子标头的HTML表的刮除
EN