首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Nutch Crawler读取分段结果

Nutch Crawler读取分段结果
EN

Stack Overflow用户
提问于 2013-06-21 18:39:57
回答 1查看 1.2K关注 0票数 1

我使用apache-nutch-crawler1.6进行爬行。在爬行之后,当我尝试使用命令读取爬行结果的内容时

代码语言:javascript
复制
 bin/nutch readseg -dump crawl/segments/* segmentAllContent

错误是

代码语言:javascript
复制
 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/crawl_generate
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/crawl_fetch
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/crawl_parse
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/content
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/parse_data
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/parse_text
            at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
            at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
            at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
            at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
            at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
            at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:416)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
            at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
            at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:224)
            at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:572)

如何在抓取后读取html内容?提前感谢

EN

回答 1

Stack Overflow用户

发布于 2013-10-04 15:59:28

我通常会先尝试合并所有的片段,

bin/nutch mergesegs爬网/合并爬网/段/*

然后

segmentAllContent /nutch readseg -dump爬网/合并/*bin

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/17233197

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档