我正在尝试用Mechanize和JRuby抓取一组页面。我使用JRuby来实现多线程,因为这个程序在核磁共振上有点慢。然而,我在Mechanize和http-cookie gem中遇到了一些似乎是非线程安全数据类型的问题。特别是,我得到了这样的错误:
RuntimeError: can't add a new key into hash during iteration
[]= at org/jruby/RubyHash.java:991
push at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/history.rb:28
add_to_history at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:1290
get at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:441
(root) at main.rb:82
open_uri at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:150
open at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:678
open at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:33
(root) at main.rb:80Mechanize中看似令人不快的代码是here
def push(page, uri = nil)
super page
index = uri ? uri : page.uri
@history_index[index.to_s] = page # offending line
shift while length > @max_size if @max_size
self
end当我注释掉lib/mechanize.rb中将访问过的页面添加到历史记录中的代码时,这个特定的错误就消失了,取而代之的是关于http-cookie gem的一个非常类似的错误:
RuntimeError: can't add a new key into hash during iteration
[]= at org/jruby/RubyHash.java:991
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar/hash_store.rb:56
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:108
add at (eval):3
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/cookie_jar.rb:22
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:192
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:322
scan_set_cookie at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie/scanner.rb:212
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:281
tap at org/jruby/RubyKernel.java:1886
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:280
parse at (eval):3
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/cookie.rb:37
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:191
save_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:857
response_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:845
each at org/jruby/RubyArray.java:1613
response_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:844
fetch at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:282
post_form at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:1281
submit at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:548
submit at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/form.rb:223
(root) at main.rb:92这就是a very similar thing going on in http-cookie
def add(cookie)
path_cookies = ((@jar[cookie.domain] ||= {})[cookie.path] ||= {})
path_cookies[cookie.name] = cookie # offending line
cleanup if (@gc_index += 1) >= @gc_threshold
self
end同样,当我注释掉http-cookie中添加cookie的代码时,错误就消失了。但是然后我的程序停止正确地抓取数据,可能是因为我删除了我正在使用的gem的上述功能。所有这一切最奇怪的事情是,程序只有在抓取了一定数量的页面后才会出错,所以我想知道我是不是做错了什么。我会分享我拥有的代码,但它是一种私人程序,我宁愿只在需要的时候分享它的一部分。顺便说一句,我的程序在MRI上工作正常,尽管有点慢。
所以,我猜我的问题是:机械化和它的依赖项与JRuby中的多线程不兼容吗?还是我做错了什么?
发布于 2015-11-29 14:27:18
看起来你遇到了一些Hash实例的并发修改问题。在这一点上很难责怪你或gem,但是像http-cookie这样的gem可能不是“真正的线程安全”(只有MRI -GIL线程安全),特别是。因为在each中有synchronization code to be found。
这很可能是一个bug,尽管你也可以通过引入一些锁来解决代码中的这些问题(希望不会对并发性能产生太大影响),这真的取决于用例。如果你能想出一个简单的可重现的多线程.rb测试用例,我会报告一个http-cookie的问题(没有检查其他gem)。
https://stackoverflow.com/questions/33943928
复制相似问题