文章/答案/技术大牛

发布

社区首页 >问答首页 >Ruby，Mongodb，Anemone:可能存在内存泄漏的网络爬虫？

问Ruby，Mongodb，Anemone:可能存在内存泄漏的网络爬虫？
EN

Stack Overflow用户

提问于 2012-02-22 20:46:06

回答 2查看 2.6K关注 0票数 7

我最近开始学习网络爬虫，我用Ruby、Anemone和Mongodb构建了一个样本爬虫来存储。我在一个可能有数十亿个链接的大型公共网站上测试这个爬虫。

crawler.rb正在索引正确的信息，尽管当我在activity monitor中检查内存使用情况时，它显示内存在不断增长。我只运行了大约6-7个小时的爬虫，内存显示mongod为1.38 is，Ruby进程为1.37 is。它似乎每小时增长100MB左右。

似乎我可能有一个内存泄漏？他们是不是一个更好的方式，我可以实现相同的抓取，而不会升级到失控的内存，以便它可以运行更长时间？

# Sample web_crawler.rb with Anemone, Mongodb and Ruby.

require 'anemone'

# do not store the page's body.
module Anemone
  class Page
    def to_hash
      {'url' => @url.to_s,
       'links' => links.map(&:to_s),
       'code' => @code,
       'visited' => @visited,
       'depth' => @depth,
       'referer' => @referer.to_s,
       'fetched' => @fetched}
    end
    def self.from_hash(hash)
      page = self.new(URI(hash['url']))
      {'@links' => hash['links'].map { |link| URI(link) },
       '@code' => hash['code'].to_i,
       '@visited' => hash['visited'],
       '@depth' => hash['depth'].to_i,
       '@referer' => hash['referer'],
       '@fetched' => hash['fetched']
      }.each do |var, value|
        page.instance_variable_set(var, value)
      end
      page
    end
  end
end


Anemone.crawl("http://www.example.com/", :discard_page_bodies => true, :threads => 1, :obey_robots_txt => true, :user_agent => "Example - Web Crawler", :large_scale_crawl => true) do | anemone |
  anemone.storage = Anemone::Storage.MongoDB

  #only crawl pages that contain /example in url
  anemone.focus_crawl do |page|
    links = page.links.delete_if do |link|
      (link.to_s =~ /example/).nil?
    end
  end

  # only process pages in the /example directory
  anemone.on_pages_like(/example/) do | page |
    regex = /some type of regex/
    example = page.doc.css('#example_div').inner_html.gsub(regex,'') rescue next

    # Save to text file
    if !example.nil? and example != ""
      open('example.txt', 'a') { |f| f.puts "#{example}"}
    end
    page.discard_doc!
  end
end

ruby

mongodb

memory-leaks

web-crawler

anemone

回答 2

Stack Overflow用户

回答已采纳

发布于 2012-04-28 02:58:39

我在这方面也有问题，但我使用redis作为数据存储。

这是我的爬虫：

require "rubygems"

require "anemone"

urls = File.open("urls.csv")
opts = {discard_page_bodies: true, skip_query_strings: true, depth_limit:2000, read_timeout: 10} 

File.open("results.csv", "a") do |result_file|

  while row = urls.gets

    row_ = row.strip.split(',')
    if row_[1].start_with?("http://")
      url = row_[1]
    else
      url = "http://#{row_[1]}"
    end 
    Anemone.crawl(url, options = opts) do |anemone|
      anemone.storage = Anemone::Storage.Redis
      puts "crawling #{url}"    
      anemone.on_every_page do |page| 

        next if page.body == nil 

        if page.body.downcase.include?("sometext")
          puts "found one at #{url}"     
          result_file.puts "#{row_[0]},#{row_[1]}"
          next

        end # end if 

      end # end on_every_page

    end # end crawl

  end # end while

  # we're done
  puts "We're done."

end # end File.open

我在anemone gem中的core.rb文件中应用了来自here的补丁：

35       # Prevent page_queue from using excessive RAM. Can indirectly limit ra    te of crawling. You'll additionally want to use discard_page_bodies and/or a     non-memory 'storage' option
36       :max_page_queue_size => 100,

..。

(以下内容过去位于第155行)

157       page_queue = SizedQueue.new(@opts[:max_page_queue_size])

我每小时有一份cron的工作：

#!/usr/bin/env python
import redis
r = redis.Redis()
r.flushall()

试着降低redis的内存使用量。我正在重新开始一个巨大的爬行，所以让我们看看它是如何进行的！

我会报告结果的..。

票数 3

Stack Overflow用户

发布于 2012-03-26 21:31:43

我正在做类似的事情，我认为你可能只是在创建大量的数据。

您没有保存正文，因此这应该有助于满足内存需求。

我能想到的唯一的其他改进是使用Redis而不是Mongo，因为我发现它对于Anemone的存储来说更具可扩展性。

检查您在mongo中的数据大小--我发现我保存了大量的行。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/9395026

复制

相似问题

问Ruby，Mongodb，Anemone:可能存在内存泄漏的网络爬虫？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Ruby，Mongodb，Anemone:可能存在内存泄漏的网络爬虫？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Ruby，Mongodb，Anemone:可能存在内存泄漏的网络爬虫？
EN