文章/答案/技术大牛

发布

社区首页 >问答首页 >-使用aspNetHidden ()绕过xmlTreeParse

问-使用aspNetHidden ()绕过xmlTreeParse
EN

Stack Overflow用户

提问于 2014-01-26 22:28:07

回答 1查看 773关注 0票数 0

这个问题需要一些时间来介绍，请原谅我。如果你能到那里的话，解决这个问题会很有趣。这种刮伤将在这个网站上使用一个循环在数千页上复制。

我正在努力刮刮网站http://www.digikey.com/product-detail/en/207314-1/A25077-ND/，希望用数字关键部件编号、可用数量等来捕获表中的数据。包括右手边与价格中断，单价，延伸价格。

使用R函数readHTMLTable()不起作用，只返回空值。这样做的原因(我相信)是因为网站使用html代码中的标签"aspNetHidden“隐藏了它的内容。

出于这个原因，我还发现在使用htmlTreeParse()和xmlTreeParse()时遇到了困难，因为整个部分都没有出现在结果中。

使用R函数从scrapeR包中刮取()

require(scrapeR)

URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")

返回完整的html代码，包括感兴趣的行：

<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber">
<meta itemprop="productID" content="sku:A25077-ND">A25077-ND</td>

<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>

但是，在返回错误的情况下，我无法从这段代码中选择节点：

no applicable method for 'xpathApply' applied to an object of class "list"

我收到了使用不同功能的错误，例如：

xpathSApply(URL,'//*[@id="pricing"]/tbody/tr[2]')

getNodeSet(URL,"//html[@class='rd-product-details-page']")

我并不是最熟悉xpath的人，但我一直在使用网页上的inspect元素识别xpath，并复制xpath。

如果你能在这方面提供任何帮助，我将不胜感激！

xpath

web-scraping

scrape

html

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-01-26 23:41:05

你还没读过“刮伤的帮助”，对吧？它返回一个列表，您需要获取该列表的一部分(如果是parse=TRUE)等等。

此外，我认为网页正在做一些沉重的浏览器检测。如果我尝试并从命令行wget页面--我得到了一个错误页面，scrape函数会得到一些有用的东西(但在您看来是不同的)，并且Chrome获得了所有编码的垃圾信息。糟透了。下面是对我有用的东西：

> URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
> tables = xpathSApply(URL[[1]],'//table')
> tables[[2]]
<table class="product-details" border="1" cellspacing="1" cellpadding="2">
  <tr class="product-details-top"/>
  <tr class="product-details-bottom">
    <td class="pricing-description" colspan="3" align="right">All prices are in US dollars.</td>
  </tr>
  <tr>
    <th align="right">Digi-Key Part Number</th>
    <td id="reportpartnumber"><meta itemprop="productID" content="sku:A25077-ND"/>A25077-ND</td>
    <td class="catalog-pricing" rowspan="6" align="center" valign="top">
      <table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
        <tr>
          <th>Price Break</th>
          <th>Unit Price</th>
          <th>Extended Price&#13;
</th>
        </tr>
        <tr>
          <td align="center">1</td>
          <td align="right">2.75000</td>
          <td align="right">2.75</td>

调整一下用例，这里我得到了所有的表，并显示了第二个表，其中包含了您想要的信息，其中一些信息可以在pricing表中直接获得：

pricing = xpathSApply(URL[[1]],'//table[@id="pricing"]')[[1]]

> pricing
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
  <tr>
    <th>Price Break</th>
    <th>Unit Price</th>
    <th>Extended Price&#13;
</th>
  </tr>
  <tr>
    <td align="center">1</td>
    <td align="right">2.75000</td>
    <td align="right">2.75</td>
  </tr>

诸若此类。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21370117

复制

相似问题

问-使用aspNetHidden ()绕过xmlTreeParse
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问-使用aspNetHidden ()绕过xmlTreeParseEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问-使用aspNetHidden ()绕过xmlTreeParse
EN