文章/答案/技术大牛

发布

社区首页 >问答首页 >如何与Scrapy在同一级别上使用不同的xpath来刮表？

问如何与Scrapy在同一级别上使用不同的xpath来刮表？
EN

Stack Overflow用户

提问于 2014-08-01 20:19:57

回答 3查看 2.6K关注 0票数 1

我得到了这个HTML (简化)：

<td class="pad10">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
</td>

我想要得到dict结构，其中包含(row是指主表中用日期分隔的表内容)：

{'04.09.2013': [1 row, 2 row],

 '05.10.2013': [1 row, 2 row, 3 row, 4 row]}

我可以用以下方式提取所有的“div”：

dt =s.xpath(‘//div收纳(@class，“按钮-左”))

我可以用以下方式提取所有的“桌子”：

tables = s.xpath('//tablecontains(@class，“记录通用计划利润率-4”))

但是我不知道如何在Scrapy解析器中将'dt‘和相应的’表‘连接起来。在刮擦过程中创建一个条件是可能的，比如:如果您找到了“div”，那么您可以提取所有下一个“表”，直到找到其他的“div”为止？

关于Chrome，我得到了以下两个元素的xPath示例：

//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/div[2]
//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/table[1]

也许它将有助于形象完整的表格结构。

解决方案(感谢@marven)：

    s = Selector(response)

    table = {}
    current_key = None
    for e in s.xpath('//td[@class="pad10"]/*') :

        if bool(int(e.xpath('@class="button-left"').extract()[0])):
            current_key  = e.xpath('text()').extract()[0]
        else:
            if bool(int(e.xpath('@class="record generic schedule margin-4"').extract()[0])):
               t = e.extract()
               if current_key in table:
                   table[current_key].append(t)
               else:
                   table[current_key] = [t]
            else:
                pass

html

xpath

scrapy

python

回答 3

Stack Overflow用户

回答已采纳

发布于 2014-08-02 12:38:21

您可以做的是选择所有节点并循环它们，同时检查当前节点是div还是table。

用这个作为我的测试用例，

<div class="asdf">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4">1</table>
  <table width="100%" class="record generic schedule margin-4">2</table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4">3</table>
  <table width="100%" class="record generic schedule margin-4">4</table>
  <table width="100%" class="record generic schedule margin-4">5</table>
  <table width="100%" class="record generic schedule margin-4">6</table>
</div>

我使用下面的循环遍历节点并更新当前节点当前“在”in中的div。

currdiv = None
mydict = {}
for e in sel.xpath('//div[@class="asdf"]/*'):
    if bool(int(e.xpath('@class="button-left"').extract()[0])):
        currdiv = e.xpath('text()').extract()[0]
        mydict[currdiv] = []
    elif currdiv is not None:
        mydict[currdiv] += e.xpath('text()').extract()

其结果是：

{u'04.09.2013': [u'1', u'2'], u'05.10.2013': [u'3', u'4', u'5', u'6']}

票数 0

Stack Overflow用户

发布于 2014-08-01 20:57:29

使用特定的格式，您可以这样做：

获取父表:t= s.xpath('//divcontains(@class，“按钮左”)/.‘)

获取第一个div: t.xpath(‘/div 1’)-您可能必须使用()=1

获取前两行: t.xpath('/tableposition() < 3')

获得第二个div: t.xpath(‘/div 2’)

获取表的其余部分:t.xpath(‘/table[table())> 2')

这是非常脆弱的，如果这个html更改，这段代码将无法工作。很难用您提供的简化的html来回答这个问题，而且不知道这个结构是否是静态的，或者它将来是否会改变。我本可以在评论中问这些问题，但我没有足够的代表:P

资料来源：

How to read attribute of a parent node from a child node in XSLT

What is the xpath to select a range of nodes?

https://stackoverflow.com/a/2407881/2368836

票数 0

Stack Overflow用户

发布于 2014-08-01 22:50:16

查看这种方法是否适用于您的情况：2

使用与上述链接问题相同的方法，我们基本上可以将<table>过滤到那些具有前兄弟级和后续兄弟级特定<div>的人。例如(使用为获取XPath s和<div>而发布的<table>标准)：

//table
    [contains(@class, "record generic schedule margin-4")]
    [
        preceding-sibling::div[contains(@class, "button-left")] 
            and 
        following-sibling::div[contains(@class, "button-left")]
    ]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25088066

复制

相似问题

问如何与Scrapy在同一级别上使用不同的xpath来刮表？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何与Scrapy在同一级别上使用不同的xpath来刮表？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何与Scrapy在同一级别上使用不同的xpath来刮表？
EN