文章/答案/技术大牛

发布

社区首页 >问答首页 >无法让Nutch添加低于特定级别的新文档

问无法让Nutch添加低于特定级别的新文档
EN

Stack Overflow用户

提问于 2014-09-09 05:30:35

回答 1查看 578关注 0票数 0

我有一个网站提供一系列的文件(pdf的)，并使用Nutch 1.8在solr中索引它们。基本url为

http://localhost/

http://localhost/doccontrol/

，例如：

/ |_doccontrol |_DC-10传入通信|_DC-11传出通信

如果我第一次运行nutch时，文件夹DC-10和DC-11包含所有要索引的文件，那么nutch会毫无问题地抓取所有文件--好:-)

如果我将一个新文件夹或文档添加到根文件夹或doccontrol文件夹中，那么下次运行nutch时，它会抓取所有新文件并对它们进行索引--好:-)

但是，添加到DC-10或DC-11目录的任何新文件都不会使用nutch的输出进行索引，如下所示(摘要)：

Injector: starting at 2014-08-29 15:19:59
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: true
Injector: update: false
Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02
Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20140829152005
Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03
Operating on segment : 20140829152005
Fetching : 20140829152005
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2014-08-29 15:20:06
Fetcher: segment: crawl/segments/20140829152005
Fetcher Timelimit set for : 1409354406733
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
.
.
.
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02
Parsing : 20140829152005
ParseSegment: starting at 2014-08-29 15:20:09
ParseSegment: segment: crawl/segments/20140829152005
Parsed (3ms):http://ws0895/doccontrol/
ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01
CrawlDB update
CrawlDb update: starting at 2014-08-29 15:20:11
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20140829152005]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01
Link inversion
LinkDb: starting at 2014-08-29 15:20:13
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl/segments/20140829152005
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02
Dedup on crawldb
Indexing 20140829152005 on SOLR index -> http://localhost:8983/solr/collection1
Indexer: starting at 2014-08-29 15:20:19
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01
Cleanup on SOLR index -> http://localhost:8983/solr/collection1
Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

坏- :-(

我希望nutch所做的是为任何新添加的文档建立索引，无论它们是在什么级别添加的。

我的nutch命令如下：

bin/crawl urls crawl http://localhost:8983/solr/collection1 4

我的nutch-site.xml包含：

<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
 <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>
 <property>
  <name>db.injector.overwrite</name>
  <value>true</value>
  <description>Whether existing records in the CrawlDB will be overwritten
  by injected records.
  </description>
 </property>
 <property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
 </property>

 <property>
  <name>db.fetch.schedule.adaptive.min_interval</name>
  <value>86400.0</value>
  <description>Minimum fetchInterval, in seconds.</description>
 </property>
  <property>
  <name>db.fetch.interval.default</name>
  <value>1209600</value>
  <description>The default number of seconds between re-fetches of a page (14 days).
  </description>
 </property>

我正在尝试做的事情(在任何级别重新搜索任何新添加的文档)是不可能的吗？

或者(更有可能的)我在配置中遗漏了什么？

有谁能给我指个方向吗？

非常感谢

保罗

nutch

回答 1

Stack Overflow用户

发布于 2015-06-21 05:29:39

您缺少的文档没有索引，可能是因为nutch-site.xml值中的db.fetch.interval.default参数是30天。Nutch暂时不会看到/DC-10中是否有新的东西。如果您将其设置为

db.fetch.interval.default 86400重新获取页面的间隔秒数。(86400 =1天)

然后，你将每天重新爬行。此外，我认为nutch用户邮件列表比stackoverflow更活跃于Nutch相关内容。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25733425

复制

相似问题

问无法让Nutch添加低于特定级别的新文档
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法让Nutch添加低于特定级别的新文档EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法让Nutch添加低于特定级别的新文档
EN