文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用RCrawler的ExtractXpathPat从1个html中提取多个项？

问如何使用RCrawler的ExtractXpathPat从1个html中提取多个项？
EN

Stack Overflow用户

提问于 2020-03-02 21:13:38

回答 1查看 114关注 0票数 0

我正试着用Rcrawler获得博物馆藏品的标签和数据。我想我在使用ExtractXpathPat变量时犯了错误，但我想不出如何修复它。

我希望有这样的输出：

1;"Titel(s)";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objecttype";"Schilderij"
1;"Objectnummer";"SK-A-2931"

但是，输出文件在第三个位置重复标题：

1;"Titel(s)";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objecttype";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objectnummer";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"

HTML如下所示：

<div class="item">
      <h3 class="item-label h4-like">Objectnummer</h3>
      <p class="item-data">SK-A-2931</p>
</div>

我的方法如下所示：

Rcrawler(Website = "https://www.rijksmuseum.nl/nl/", 
         no_cores = 4, no_conn = 4,
         dataUrlfilter = '.*/collectie/.*',
         ExtractXpathPat = c('//*[@class="item-label h4-like"]', '//*[@class="item-data"]'), 
         PatternsNames = c('label','data'),
         ManyPerPattern = TRUE)

页面并不总是有相同的标签，有时它有没有相应数据的标签。有时数据在段落中，有时在无序列表中。

我的最终目标是创建一个csv，其中包含站点的所有标签，并在每一行中包含相应的数据。

这个问题是进入收集标签和数据的第一步，然后我将使用这些标签和数据来创建上述csv。

xpath

web-crawler

rcrawler

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-03-03 02:10:41

我不使用RCrawler来刮，但我认为您的XPaths需要修复。我这么做是为了你

Rcrawler(Website = "https://www.rijksmuseum.nl/nl/", 
         no_cores = 4, no_conn = 4,
         dataUrlfilter = '.*/collectie/.*',
         ExtractXpathPat = c("//h3[@class='item-label h4-like'][.='Titel(s)']/following-sibling::p/text()","//h3[@class='item-label h4-like'][.='Objecttype']/following::a[1]/text()","//h3[@class='item-label h4-like'][.='Objectnummer']/following-sibling::p/text()"), 
         PatternsNames = c("Titel(s)", "Objecttype","Objectnummer"),
         ManyPerPattern = TRUE)

我运行了几分钟，它似乎奏效了：

DATA[[1]]
$`PageID`
[1] 1

$`Titel(s)`
[1] "De Staalmeesters"                                                                   
[2] "De waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"

$Objecttype
[1] "schilderij"

$Objectnummer
[1] "SK-C-6"

更多选择：

蛮力。由于您还不知道所有的标签名称，而且如果您不想编写特定的XPaths，那么可以在RCrawlers ExtractXpathPat中尝试这样的方法：

c("string((//h3[@class='item-label h4-like'])[1]/parent::*)","string((//h3[@class='item-label h4-like'])[2]/parent::*)",...,"string((//h3[@class='item-label h4-like'])[30]/parent::*)")

在这里，我们只是从位置1增加到30。你可以试试40,50，这取决于你。

PatternsNames = c("Item1“、"Item2”、.、"Item30")

结果实例：

Item1:Title(s) The Seven Works of MercyPolyptych with the Seven Works of Charity 
Item2:Object type painting 
Item3:Object number SK-A-2815
...
Item17:Parts The Seven Works of Mercy (SK-A-2815-1) The Seven Works of Mercy (SK-A-2815-2) The Seven Works of Mercy (SK-A-2815-3) The Seven Works of Mercy (SK-A-2815-4) The Seven Works of Mercy (SK-A-2815-5) The Seven Works of Mercy (SK-A-2815-6) The Seven Works of Mercy (SK-A-2815-7)
...
Item29:
Item30:

然后你需要整理数据(拆分，整理，重组.)使用适当的工具(dplyr、stringr)生成适当的csv。

如果此选项不起作用，则可以确定可能具有的所有标签名称(获取网页的所有//h3[@class='item-label h4-like']/text()并删除重复项以保留唯一值。然后相应地编写Xpath。这样，.csv就更容易生成了。

您还可以在RCrawler之外工作(使用其他工具)，并编写一些函数来正确地刮取数据(使用应用函数或用于循环)。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60496753

复制

相似问题

问如何使用RCrawler的ExtractXpathPat从1个html中提取多个项？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用RCrawler的ExtractXpathPat从1个html中提取多个项？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用RCrawler的ExtractXpathPat从1个html中提取多个项？
EN