如何从如下XML中提取地址:
<address class=\"addressReset\">
<span rel=\"v:address\">
<span dir=\"ltr\">
<span class=\"street-address\" property=\"v:street-address\">77 Yesler Way</span>,
<span class=\"locality\">
<span property=\"v:locality\">Seattle</span>,
<span property=\"v:region\">WA</span>
<span property=\"v:postal-code\">98104-2530</span>
</span>
</span>
</span>
</address>我想将这些值提取到变量streetAddr、City、State、Zipcode中。
我试着做:
require(XML)
data <- xmlParse(Address)
xml_data <- xmlToList(data)但是我的xml_data非常不格式化,我不知道如何使用它。例如,
xml_data$body$address$span$span$span$text 给出街道地址。有更好的方法来解析这个吗?我可以使用类和属性来获取我想要的值吗?
发布于 2016-02-17 02:23:13
另一个library(XML)解决方案使用XPath。解析文本
library(XML)
xml = xmlParse(txt)创建XPath查询
properties = c("street-address", "locality", "region", "postal-code")
queries = sprintf("//span[@property='v:%s']/text()", properties)
names(queries) = properties在每个查询上使用xpathSApply()和xmlValue()检索结果
sapply(queries, xpathSApply, doc=xml, fun=xmlValue)结果是(字符)矩阵。
> sapply(queries, xpathSApply, doc=xml, fun=xmlValue)
street-address locality region postal-code
[1,] "77 Yesler Way" "Seattle" "WA" "98104-2530"
[2,] "88 Yesler Way" "Seattttttle" "WAAAA" "99999-2530"发布于 2016-02-16 22:10:29
为什么不创建一个XML树并导航呢?
library(XML)
datatree <- xmlTreeParse(Address)
topNode <- xmlRoot(datatree)
address <- xmlSApply(topNode, function(x) xmlSApply(x[[1]][[1]][[1]], xmlValue))
print(address)发布于 2016-02-16 23:33:52
以下是xml2的一种方法
library(xml2)
library(purrr)
txt <- '<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<addresses>
<address class=\"addressReset\">
<span rel=\"v:address\">
<span dir=\"ltr\">
<span class=\"street-address\" property=\"v:street-address\">77 Yesler Way</span>,
<span class=\"locality\">
<span property=\"v:locality\">Seattle</span>,
<span property=\"v:region\">WA</span>
<span property=\"v:postal-code\">98104-2530</span>
</span>
</span>
</span>
</address>
<address class=\"addressReset\">
<span rel=\"v:address\">
<span dir=\"ltr\">
<span class=\"street-address\" property=\"v:street-address\">88 Yesler Way</span>,
<span class=\"locality\">
<span property=\"v:locality\">Seattttttle</span>,
<span property=\"v:region\">WAAAA</span>
<span property=\"v:postal-code\">99999-2530</span>
</span>
</span>
</span>
</address>
</addresses>
'
doc <- read_xml(txt)
properties <- c("v:street-address", "v:locality", "v:region", "v:postal-code")
map_df(xml_find_all(doc, "//address"), function(x) {
data.frame(as.list(set_names(map_chr(properties, function(y) {
xml_text(xml_find_all(x, sprintf(".//span[@property='%s']", y)))
}), properties)), stringsAsFactors=FALSE)
})
## Source: local data frame [2 x 4]
##
## v.street.address v.locality v.region v.postal.code
## (chr) (chr) (chr) (chr)
## 1 77 Yesler Way Seattle WA 98104-2530
## 2 88 Yesler Way Seattttttle WAAAA 99999-2530我想应该有父母的标签。这可能是一个错误的假设,但推断应该是简单明了的。
对于每个地址,使用命名属性提取每个跨度,将它们转换为数据帧并将它们全部绑定在一起。
https://stackoverflow.com/questions/35442646
复制相似问题