文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用xpath找到需要的单词？

问如何使用xpath找到需要的单词？
EN

Stack Overflow用户

提问于 2014-12-19 09:24:17

回答 3查看 159关注 0票数 0

我用抓取来爬行一个网站，但我不知道如何解析和找到单词。以下是网站，我想找到“你好，我在这里”。

这是我的xpath代码：

//div[@class='sort_left']/p/strong/a/href/text()

Html部件：

<div class="sort hottest_dishes1">
    <ul class="sort_title">
        <li class="current"><a href="/list_rest.php?a=75&s=1">按默认排序</a></li>
        <li class=""><a href="/list_rest.php?a=75&s=2">按人气排序</a></li>
    </ul>

    <ol class="sort_content">
        <li class="show">
            <div class="sort_yi">                              
                <div class="sort_left">
                    <p class="li_title">
                        <strong class="span_left ">
                            <a href="/rest/75/1879">hello I'm here<span class="restaurant_list_hot"></span></a>
                            <span> （川菜） </span>
                        </strong>
                        <span class="span_d_right3" title="馋嘴牛蛙特价只要9.9元，每单限点1份">馋嘴牛蛙特价9块9</span>
                    </p>
                    <p class="consume">
                        <strong>人均消费：</strong>
                        <b><span>¥70</span>元</b>
                        <a href="http://www.dianping.com/shop/2271520" target="_blank">看网友点评</a>
                    </p>
                    <p class="sign">
                        <strong>招牌菜：</strong>
                        <span>水煮鲶鱼 馋嘴牛蛙 酸梅汤 钵钵鸡 香辣土豆丝 毛血旺 香口猪手 ……</span>
                    </p> 
                </div>
                <div class="sort_right">
                    <a href="/rest/75/1879">看菜谱</a>
                </div>
                <div class="sort_all"  >
                    <strong>送达时间：</strong><span>60分钟</span>                                    
                </div>
            </div>

我在shell中使用response.css是正确的，但是在刮伤中，它什么也不返回，我写错了代码吗？以下是我的代码：

def parse_torrent(self, response):
    torrent = TorrentItem()
    torrent['url'] = response.url
    torrent['name'] = response.xpath("//div[@class='sort_left']/p/strong/a[1]").extract()[1]
    torrent['description'] = response.xpath("//div[@id='list_content']/div/div/ol/li/div/div/p/strong[1]/following-sibling::span[1]").extract()
    torrent['size'] = response.xpath("//div[@id='list_content']/div/div/ol/li/div/div/p/span[1]").extract()
    return torrent

强文本

scrapy

xpath

回答 3

Stack Overflow用户

发布于 2014-12-19 09:32:17

我在HTML摘录中看不到一个<div>，它有一个带有值'list_content'的属性--所以[@id='list_content']谓词过滤掉了所有东西，不管其余的XPath表达式是什么。表达式计算的结果是一个空序列。

问题编辑后的：

HTML中没有<href>元素，因此.../a/href子表达式没有选择任何内容。

href是<a>的一个属性--使用.../a/@href来处理href属性内容。

但是，如果您仍然希望找到'hello I‘s’文本，那么您需要找到<a>元素内容--使用.../a/text()。

票数 0

Stack Overflow用户

发布于 2014-12-19 09:36:25

这可以是你需要做的事情的一个例子：

def parse_torrent(self, response):
    print response.xpath('//div[@class="sort_left"]/p/strong/a/text()').extract()[0]

产出：

2014-12-19 10:58:28+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: skema_crawler)
2014-12-19 10:58:28+0100 [scrapy] INFO: Optional features available: ssl, http11
2014-12-19 10:58:28+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'skema_crawler.spiders', 'SPIDER_MODULES': ['skema_crawler.spiders'], 'BOT_NAME': 'skema_crawler'}
2014-12-19 10:58:28+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-19 10:58:29+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-19 10:58:29+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-19 10:58:29+0100 [scrapy] INFO: Enabled item pipelines:
2014-12-19 10:58:29+0100 [linkedin] INFO: Spider opened
2014-12-19 10:58:29+0100 [linkedin] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-19 10:58:29+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-19 10:58:29+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-19 10:58:29+0100 [linkedin] DEBUG: Crawled (200) <GET file:///C:/1.html> (referer: None)
hello I'm here
2014-12-19 10:58:29+0100 [linkedin] INFO: Closing spider (finished)
2014-12-19 10:58:29+0100 [linkedin] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 232,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 1599,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 12, 19, 9, 58, 29, 241000),
         'log_count/DEBUG': 3,
         'log_count/INFO': 7,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2014, 12, 19, 9, 58, 29, 213000)}
2014-12-19 10:58:29+0100 [linkedin] INFO: Spider closed (finished)

你可以看到hello I'm here出现了。

你指的是

response.xpath("//div[@class='sort_left']/p/strong/a[1]").extract()[1]

您需要将text()添加到您的xpath中，并且由于您的a内部有一个span，所以您需要获取元素而不是1。

response.xpath("//div[@class='sort_left']/p/strong/a/text()").extract()[0]

票数 0

Stack Overflow用户

发布于 2014-12-19 09:40:59

我个人认为css选择器比使用xpath定位内容容易得多。对于爬行给定文档的响应对象，为什么不尝试response.css('p[class="li_title"] a::text')[0].extract()。

(我测试了它，它在刮破的外壳中起作用。输出：u"hello I'm here")

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/27562957

复制

相似问题

问如何使用xpath找到需要的单词？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用xpath找到需要的单词？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用xpath找到需要的单词？
EN