文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么我的网页爬虫方法找不到所有的链接？

问为什么我的网页爬虫方法找不到所有的链接？
EN

Stack Overflow用户

提问于 2015-04-11 15:36:41

回答 2查看 367关注 0票数 0

我试图创建一个简单的网络爬虫，所以我写了如下：

(方法get_links获取我们要查找的父链接)

require 'nokogiri'
require 'open-uri'

def get_links(link)
    link = "http://#{link}"
    doc = Nokogiri::HTML(open(link))
    links = doc.css('a')
    hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
    array = hrefs.select {|i| i[0] == "/"}
    host = URI.parse(link).host
    links_list = array.map {|a| "#{host}#{a}"}
end

(方法search_links，从get_links方法获取数组并搜索该数组)

def search_links(urls)
    urls = get_links(link)
    urls.uniq.each do |url|
        begin
            links = get_links(url)
            compare = urls & links
            urls << links - compare
            urls.flatten!
        rescue OpenURI::HTTPError
            warn "Skipping invalid link #{url}"
        end
    end
    return urls
end

这种方法可以从网站中找到大部分链接，但不是全部。

我做错什么了？我应该使用哪种算法？

nokogiri

ruby

web-crawler

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-04-11 16:45:18

关于您的代码的一些评论：

  def get_links(link)
    link = "http://#{link}"
    # You're assuming the protocol is always http.
    # This isn't the only protocol on used on the web.

    doc = Nokogiri::HTML(open(link))

    links = doc.css('a')
    hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
    # You can write these two lines more compact as
    #   hrefs = doc.xpath('//a/@href').map(&:to_s).uniq.delete_if(&:empty?)

    array = hrefs.select {|i| i[0] == "/"}
    # I guess you want to handle URLs that are relative to the host.
    # However, URLs relative to the protocol (starting with '//')
    # will also be selected by this condition.

    host = URI.parse(link).host
    links_list = array.map {|a| "#{host}#{a}"}
    # The value assigned to links_list will implicitly be returned.
    # (The assignment itself is futile, the right-hand-part alone would
    # suffice.) Because this builds on `array` all absolute URLs will be
    # missing from the return value.
  end

解释

hrefs = doc.xpath('//a/@href').map(&:to_s).uniq.delete_if(&:empty?)

.xpath('//a/@href')使用XPath的属性语法直接获取a元素的href属性。
.map(&:to_s)是.map { |item| item.to_s }的缩写符号。
.delete_if(&:empty?)使用相同的缩写表示法

和关于第二项职能的评论：

def search_links(urls)
    urls = get_links(link)
    urls.uniq.each do |url|
      begin
        links = get_links(url)


        compare = urls & links
        urls << links - compare
        urls.flatten!
        # How about using a Set instead of an Array and
        # thus have the collection provide uniqueness of
        # its items, so that you don't have to?


      rescue OpenURI::HTTPError
         warn "Skipping invalid link #{url}"
      end
    end
    return urls
    # This function isn't recursive, it just calls `get_links` on two
    # 'levels'. Thus you search only two levels deep and return findings
    # from the first and second level combined. (Without the "zero'th"
    # level - the URL passed into `search_links`. Unless off course if it
    # also occured on the first or second level.)
    #
    # Is this what you intended?
  end

票数 0

Stack Overflow用户

发布于 2015-04-11 22:52:45

您可能应该使用机械化：

require 'mechanize'
agent = Mechanize.new
page = agent.get url
links = page.search('a[href]').map{|a| page.uri.merge(a[:href]).to_s}
# if you want to remove links with a different host (hyperlinks?)
links.reject!{|l| URI.parse(l).host != page.uri.host}

否则，您将很难正确地将相对urls转换为绝对urls。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/29579672

复制

相似问题

问为什么我的网页爬虫方法找不到所有的链接？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我的网页爬虫方法找不到所有的链接？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我的网页爬虫方法找不到所有的链接？
EN