文章/答案/技术大牛

发布

社区首页 >问答首页 >刮擦蜘蛛只提取第一个表元素

问刮擦蜘蛛只提取第一个表元素
EN

Stack Overflow用户

提问于 2019-05-03 08:33:44

回答 1查看 21关注 0票数 0

我试着刮这个网址：'search.siemens.com/en/?q=iot‘。作为开始，我只是对titel和类别感兴趣，在下面的截图中说明了这一点。然而，当我运行我的蜘蛛时，我只得到第一个元素：

{'titel': 'MindSphere – open ', 
'category': ' operating system - Software - Siemens Global Website'}

这是我的蜘蛛

import scrapy

class SiemensHtmlSpider(scrapy.Spider):
    name = 'siemens_html'
    allowed_domains = ['search.siemens.com/en/?q=iot']
    start_urls = ['http://search.siemens.com/en/?q=iot/']

    def parse(self, response):
        #//dl[@id='search-resultlist']/dt/a
        for element in response.xpath("//dl[@id='search-resultlist']"):
            yield {
                'titel': element.xpath('//dt/a/text()[1]').extract_first(),
                'category': element.xpath('//dt/a/text()[2]').extract_first()
            }

我的截图是：

python-3.x

xpath

scrapy

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-05-03 08:36:05

替换

yield {
    'titel': element.xpath('//dt/a/text()[1]').extract_first(),
    'category': element.xpath('//dt/a/text()[2]').extract_first()
}

通过以下方式：

yield {
    'titel': element.xpath('.//dt/a/text()[1]').extract_first(),
    'category': element.xpath('.//dt/a/text()[2]').extract_first()
}

注意xpath选择器前面的点，它们表示相对路径。

UPD:很小的提示，也可以检查您的allowed_domains值。它应该以这样的方式编写：allowed_domains = ['search.siemens.com']

UPD2：for循环中的主选择器也有问题，最好是在具体的块列表上有更多的预置和点。

for element in response.xpath("//dl[@id='search-resultlist']/dt"):
    yield {
        'titel': element.xpath('.//a/text()[1]').get(),
        'category': element.xpath('.//a/text()[2]').get()
    }

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55966072

复制

相似问题

问刮擦蜘蛛只提取第一个表元素
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问刮擦蜘蛛只提取第一个表元素EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问刮擦蜘蛛只提取第一个表元素
EN