问如何使用Anemone仅“爬行”根URL？
EN

Stack Overflow用户

提问于 2013-01-09 10:48:35

回答 1查看 1.6K关注 0票数 3

在下面的例子中，我希望anemone只在根网址(example.com)上执行。我不确定是否应该应用on_page_like方法，如果应该，我需要什么模式。

  require 'anemone'
    Anemone.crawl("http://www.example.com/") do |anemone|
      anemone.on_pages_like(???) do |page|
        # some code to execute
      end
    end

ruby-on-rails-3

ruby-on-rails

ruby

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-01-09 11:28:40

require 'anemone'
Anemone.crawl("http://www.example.com/", :depth_limit => 1) do |anemone|
  # some code to execute
end

您还可以在选项散列中指定以下内容，以下是默认值：

# run 4 Tentacle threads to fetch pages
:threads => 4,
# disable verbose output
:verbose => false,
# don't throw away the page response body after scanning it for links
:discard_page_bodies => false,
# identify self as Anemone/VERSION
:user_agent => "Anemone/#{Anemone::VERSION}",
# no delay between requests
:delay => 0,
# don't obey the robots exclusion protocol
:obey_robots_txt => false,
# by default, don't limit the depth of the crawl
:depth_limit => false,
# number of times HTTP redirects will be followed
:redirect_limit => 5,
# storage engine defaults to Hash in +process_options+ if none specified
:storage => nil,
# Hash of cookie name => value to send with HTTP requests
:cookies => nil,
# accept cookies from the server and send them back?
:accept_cookies => false,
# skip any link with a query string? e.g. http://foo.com/?u=user
:skip_query_strings => false,
# proxy server hostname
:proxy_host => nil,
# proxy server port number
:proxy_port => false,
# HTTP read timeout in seconds
:read_timeout => nil

我个人的经验是，Anemone速度不是很快，有很多角落案例。缺少文档(正如您所经历的)，而且作者似乎没有维护该项目。YMMV.我很快就试了试Nutch，但玩得并不多，但它似乎更快。没有基准，抱歉。

票数 6

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/14227555

复制

相似问题

问如何使用Anemone仅“爬行”根URL？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Anemone仅“爬行”根URL？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Anemone仅“爬行”根URL？
EN