在生成一组新的URL以获取Nutch方面有问题。
下面是我使用的命令:
# $NUTCH_HOME/runtime/local/bin/nutch generate -topN 10结果:
Generator: starting at 2016-12-14 19:16:50
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/root/bob/-topN/current
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:329)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862)
at org.apache.nutch.crawl.Generator.generate(Generator.java:589)
at org.apache.nutch.crawl.Generator.run(Generator.java:764)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Generator.main(Generator.java:717)我遗漏了什么?
发布于 2016-12-15 17:03:13
您的命令中缺少参数,查看一下https://wiki.apache.org/nutch/bin/nutch%20generate,基本上您需要提供爬虫数据库和段子目录的路径。
https://stackoverflow.com/questions/41150561
复制相似问题