我需要聚合来自几个不同网站的内容(主要是HTML页面和PDF文档)。我目前正在试用Heritrix (3.2.0),看看它是否能满足我的需求。
虽然文档非常详细,但引擎似乎并没有像我预期的那样工作。我设置了一些简单的任务,并以许多不同的方式配置了DecideRules,但无论我做什么,我都发现Heritrix要么下载了太多的内容,要么什么都没有。
这是我想要做的一个例子。我将Heritrix指向URL,如下所示...example.com/news/speeches。这是一个网页,它有一个HTML表,其中包含指向各个演讲的链接(例如,example.com/news/speech/speech 1.html、xample.com/news/speech/speech 2.html等)。我真的只需要HTML和PDF文档从父页面的下一层。我想阻止Heritrix导航超过1级,如果不在example.com域的这个特定路径下,则阻止它拉出内容,防止它导航到另一个域,并将其限制为html和pdf内容。
下面的配置是我认为应该可以工作但不能工作的配置
<bean id="longerOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">
<property name="properties">
<props>
<prop key="seeds.textSource.value">
# URLS HERE
example.com/news/speeches
</prop>
</props>
</property>
</bean>
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<!-- <property name="logToFile" value="false" /> -->
<property name="rules">
<list>
<!-- Begin by REJECTing all... -->
<bean class="org.archive.modules.deciderules.RejectDecideRule">
</bean>
<!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... -->
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<!-- <property name="seedsAsSurtPrefixes" value="true" /> -->
<!-- <property name="alsoCheckVia" value="false" /> -->
<!-- <property name="surtsSourceFile" value="" /> -->
<!-- <property name="surtsDumpFile" value="${launchId}/surts.dump" /> -->
<property name="surtsSource">
<bean class="org.archive.spring.ConfigString">
<property name="value">
<value>
example.com/news/speeches
</value>
</property>
</bean>
</property>
</bean>
<!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="listLogicalOr" value="true" />
<property name="regexList">
<list>
<value>.*(?i)(\.(avi|wmv|mpe?g|mp3))$</value>
<value>.*(?i)(\.(rar|zip|tar|gz))$</value>
<value>.*(?i)(\.(xls|odt))$</value>
<value>.*(?i)(\.(xml))$</value>
<value>.*(?i)(\.(txt|conf|pdf))$</value>
<value>.*(?i)(\.(swf))$</value>
<value>.*(?i)(\.(js|css))$</value>
<value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
</list>
</property>
</bean>
<!-- ...but REJECT those more than a configured link-hop-count from start... -->
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<!-- <property name="maxHops" value="20" /> -->
</bean>
<!-- ...but ACCEPT those more than a configured link-hop-count from start... -->
<!--bean class="org.archive.modules.deciderules.TransclusionDecideRule"-->
<!-- <property name="maxTransHops" value="2" /> -->
<!-- <property name="maxSpeculativeHops" value="1" /> -->
<!--/bean-->
<!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... -->
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property name="decision" value="REJECT"/>
<property name="seedsAsSurtPrefixes" value="false"/>
<property name="surtsDumpFile" value="${launchId}/negative-surts.dump" />
<!-- <property name="surtsSource">
<bean class="org.archive.spring.ConfigFile">
<property name="path" value="negative-surts.txt" />
</bean>
</property> -->
</bean>
<!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<!-- <property name="listLogicalOr" value="true" /> -->
<!-- <property name="regexList">
<list>
</list>
</property> -->
</bean>
<!-- ...and REJECT those with suspicious repeating path-segments... -->
<bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
<!-- <property name="maxRepetitions" value="2" /> -->
</bean>
<!-- ...and REJECT those with more than threshold number of path-segments... -->
<bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
<!-- <property name="maxPathDepth" value="20" /> -->
</bean>
<!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
<bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>
<!-- ...but always REJECT those with unsupported URI schemes -->
<bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
</bean>
</list>
</property>
</bean>我希望我的爬行器只能抓取十几个左右的html文档,因为这是/speech路径中包含的所有文档。大约半小时后,我停止了爬行,因为它正在下载800+文档,因为我发现它正在向后遍历到父级路径。我也尝试过RegEx规则,但没有成功。任何帮助都将不胜感激。
发布于 2017-01-10 23:16:58
调试此类问题的一件好事是启用作用域决策的日志记录。(取消注释包含logToFile的行,并将其设置为true。这将为每个URI提供做出包含或拒绝它的决定的规则。因此,您将能够看到您的哪个规则没有正确配置,并接受本应被拒绝的URI。
https://stackoverflow.com/questions/32016535
复制相似问题