我有一个网站提供一系列的文件(pdf的),并使用Nutch 1.8在solr中索引它们。基本url为
http://localhost/并且文档存储在目录中的一系列目录中
http://localhost/doccontrol/,例如:
/ |_doccontrol |_DC-10传入通信|_DC-11传出通信
如果我第一次运行nutch时,文件夹DC-10和DC-11包含所有要索引的文件,那么nutch会毫无问题地抓取所有文件--好:-)
如果我将一个新文件夹或文档添加到根文件夹或doccontrol文件夹中,那么下次运行nutch时,它会抓取所有新文件并对它们进行索引--好:-)
但是,添加到DC-10或DC-11目录的任何新文件都不会使用nutch的输出进行索引,如下所示(摘要):
Injector: starting at 2014-08-29 15:19:59
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: true
Injector: update: false
Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02
Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20140829152005
Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03
Operating on segment : 20140829152005
Fetching : 20140829152005
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2014-08-29 15:20:06
Fetcher: segment: crawl/segments/20140829152005
Fetcher Timelimit set for : 1409354406733
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
.
.
.
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02
Parsing : 20140829152005
ParseSegment: starting at 2014-08-29 15:20:09
ParseSegment: segment: crawl/segments/20140829152005
Parsed (3ms):http://ws0895/doccontrol/
ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01
CrawlDB update
CrawlDb update: starting at 2014-08-29 15:20:11
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20140829152005]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01
Link inversion
LinkDb: starting at 2014-08-29 15:20:13
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl/segments/20140829152005
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02
Dedup on crawldb
Indexing 20140829152005 on SOLR index -> http://localhost:8983/solr/collection1
Indexer: starting at 2014-08-29 15:20:19
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01
Cleanup on SOLR index -> http://localhost:8983/solr/collection1
Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...坏- :-(
我希望nutch所做的是为任何新添加的文档建立索引,无论它们是在什么级别添加的。
我的nutch命令如下:
bin/crawl urls crawl http://localhost:8983/solr/collection1 4我的nutch-site.xml包含:
<property>
<name>db.update.additions.allowed</name>
<value>true</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
<property>
<name>db.injector.overwrite</name>
<value>true</value>
<description>Whether existing records in the CrawlDB will be overwritten
by injected records.
</description>
</property>
<property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
<description>The implementation of fetch schedule. DefaultFetchSchedule simply
adds the original fetchInterval to the last fetch time, regardless of
page changes.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.min_interval</name>
<value>86400.0</value>
<description>Minimum fetchInterval, in seconds.</description>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>1209600</value>
<description>The default number of seconds between re-fetches of a page (14 days).
</description>
</property>我正在尝试做的事情(在任何级别重新搜索任何新添加的文档)是不可能的吗?
或者(更有可能的)我在配置中遗漏了什么?
有谁能给我指个方向吗?
非常感谢
保罗
发布于 2015-06-21 05:29:39
您缺少的文档没有索引,可能是因为nutch-site.xml值中的db.fetch.interval.default参数是30天。Nutch暂时不会看到/DC-10中是否有新的东西。如果您将其设置为
db.fetch.interval.default 86400重新获取页面的间隔秒数。(86400 =1天)
然后,你将每天重新爬行。此外,我认为nutch用户邮件列表比stackoverflow更活跃于Nutch相关内容。
https://stackoverflow.com/questions/25733425
复制相似问题