从一周以来,我一直在学习Nutch (版本Nutch-1.14),并且在本地模式和Hadoop-2.7.2 (伪分布式模式)中工作得很好。今天,我在nutch-site.xml中遇到了"take.screenshot","screenshot.location“属性,在修改这些属性之后,nutch正在爬行种子urls,但没有在本地模式和Hadoop模式下获取屏幕截图。
nutch-site.xml本地模式设置
<property>
<name>take.screenshot</name>
<value>true</value>
<description>
Boolean property determining whether the protocol-htmlunit
WebDriver should capture a screenshot of the URL. If set to
true remember to define the 'screenshot.location'
property as this determines the location screenshots should be
persisted to on HDFS. If that property is not set, screenshots
are simply discarded.
</description>
</property>
<property>
<name>screenshot.location</name>
<value>/home/user/nutch-1.14/screenshot</value>
<description>
The location on disk where a URL screenshot should be saved
to if the 'take.screenshot' property is set to true.
By default this is null, in this case screenshots held in memory
are simply discarded.
</description>
</property>nutch-site.xml设置用于Hadoop
<property>
<name>take.screenshot</name>
<value>true</value>
</property>
<property>
<name>screenshot.location</name>
<value>/screenshot</value>
</property>注释“屏幕截图”目录出现在HDFS中。
发布于 2018-02-23 10:55:44
你启用protocol-selenium了吗?基本上,这将只适用于此协议,默认情况下,Nutch使用不支持此选项的protocol-http插件,即使您在配置中启用了这些设置。
发布于 2018-02-23 07:17:51
HtmlUnit是一个“用于Java程序的无GUI浏览器”(参见http://htmlunit.sourceforge.net/)。这意味着,HtmlUnit根本不呈现html页面。在内部,所有的操作都是基于dom树完成的,没有任何布局。这就是为什么没有选择拍摄截图的原因。
https://stackoverflow.com/questions/48915154
复制相似问题