首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >我的Scrapy Crawler找不到嵌套的a href标签

我的Scrapy Crawler找不到嵌套的a href标签
EN

Stack Overflow用户
提问于 2017-02-14 21:13:27
回答 1查看 97关注 0票数 0

我写了一个Scrapy爬虫,如下所示:

代码语言:javascript
复制
import sys, getopt
import scrapy
from scrapy.spiders import Spider
from scrapy.http    import Request
import re

class TutsplusItem(scrapy.Item):
  title = scrapy.Field()

class MySpider(Spider):
  name = "tutsplus"
  allowed_domains   = ["bbc.com"]
  start_urls = ["http://www.bbc.com/"]
  crawling_level=None

  def __init__(self,crawling_level, *args):
      MySpider.crawling_level=crawling_level
      super(MySpider, self).__init__(self)

  def parse(self, response):
    links = response.xpath('//a/@href').extract()
    print("Links are %s" %links)
    print ("Crawling level is %s " %MySpider.crawling_level )


    # We stored already crawled links in this list
    level=MySpider.crawling_level
    crawledLinks = []

    # Pattern to check proper link
    # I only want to get the tutorial posts
    # linkPattern = re.compile("^\/tutorials\?page=\d+")


    for link in links:
      # If it is a proper link and is not checked yet, yield it to the Spider
      #if linkPattern.match(link) and not link in crawledLinks:
      if not link in crawledLinks and level>0:
        link = "http://www.bbc.com" + link
        crawledLinks.append(link)
        yield Request(link, self.parse)


    titles = response.xpath('//a[contains(@class, "media__link")]/@*').extract()
    #titles = response.xpath('//a/@href').extract()
    print ("Titles are %s" %titles )

    count=0
    for title in titles:
      item = TutsplusItem()
      item["title"] = title
      print("Title is : %s" %title)
      yield item

然而,在我的代码和这行代码中有一个问题

代码语言:javascript
复制
titles = response.xpath('//a[contains(@class, "media__link")]').extract()

它不返回任何链接。HTML如下:

代码语言:javascript
复制
<h3 class="media__title">
  <a class="media__link" href="/news/world-us-canada-38965557" rev="hero1|headline" >
  Trump adviser quits over Russia contacts</a>
</h3>

我的输出标题总是空的。我的XPATH有什么问题吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-02-15 00:16:26

xpath不正确!使用chrome dev工具进行xpath调试:

代码语言:javascript
复制
"//a[@class='media__link']/@href"

titles = response.xpath('//a[@class='media__link']/@href').extract()
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/42227122

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档