文章/答案/技术大牛

发布

问XPath 1.0表达式返回NULL
EN

Stack Overflow用户

提问于 2014-09-22 12:37:33

回答 2查看 137关注 0票数 2

从这个网站，http://www.lewisthomason.com/locations/的这部分HTML代码有我想要提取的东西，即公司办公室所在的四个城市(诺克斯维尔、孟菲斯、纳什维尔和塞维尔)

<div id="the_content">
<div class="one_fourth">
<h3>
<cufon class="cufon cufon-canvas" alt="KNOXVILLE" style="width: 87px; height: 26px;">
<canvas width="104" height="25" style="width: 104px; height: 25px; top: -1px; left: 0px;"></canvas>
<cufontext>KNOXVILLE</cufontext>
</cufon>
</h3>
<p>
<h6>
</div>
<div class="one_fourth">
<div class="one_fourth">
<div class="one_fourth last">
<div class="clearboth"></div>
<p></p>
</div>
</div>
<div id="secondary"> </div>
<div class="clearboth"></div>
</div>

我尝试过几种不同的XPath搜索

require(XML)
require(httr)
doc <- content(GET('http://www.lewisthomason.com/locations/'))

xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)

我得到的都是空的。什么表达方式将带回城市的名称，或整个地址？我知道第四个城市有，所以我会修改最后的表述。

谢谢你的指导。

html-parsing

rvest

html

xpath

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-09-22 14:17:39

红背心通过CSS选择器进行救援(XPath也工作)：

library(rvest) # for scraping
library(httr)  # only for user_agent()

pg <- html_session("http://www.lewisthomason.com/locations/", 
                   user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))

# get names
pg %>% html_nodes("h3") %>% html_text()

## [1] "KNOXVILLE"   "MEMPHIS"     "NASHVILLE"   "SEVIERVILLE"

# get locations
pg %>% html_nodes("h3~p") %>% html_text() %>% .[1:4]

## [1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
## [2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
## [3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
## [4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"

票数 3

Stack Overflow用户

发布于 2014-09-22 12:43:13

该网站正在检查用户代理。如果您给它一个适当的用户代理，它将发送正确的内容：

require(XML)
require(RCurl)
myAgent <- "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
doc <- getURL('http://www.lewisthomason.com/locations/', useragent = myAgent)
doc <- htmlParse(doc)


> xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
[4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"                                       
[5] ""                                                                                                                             
> xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"

否则它将发送：

> getURL('http://www.lewisthomason.com/locations/')
[1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don't have permission to access /locations/\non this server.</p>\n</body></html>\n"

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25974341

复制

相似问题

问XPath 1.0表达式返回NULL
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问XPath 1.0表达式返回NULLEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问XPath 1.0表达式返回NULL
EN