文章/答案/技术大牛

发布

问Nutch取超时
EN

Stack Overflow用户

提问于 2017-02-27 18:32:14

回答 2查看 1K关注 0票数 1

我试图用nutch-1.12爬行某些站点，但是对于种子列表中的某些站点，抓取不能正常工作：

http://www.nature.com/ (1)
https://www.theguardian.com/international (2)
http://www.geomar.de (3)

正如您在下面(2)和(3)的日志中所看到的那样，(2)和(3)在取取(1)时可以很好地工作，而在浏览器中，它本身的链接会导致超时。由于我不想增加等待时间，并且试图大幅增加等待时间，所以我想知道是否有另一种方法来确定为什么会产生这个超时，以及如何修复它。

日志

Injector: starting at 2017-02-27 18:33:38
Injector: crawlDb: nature_crawl/crawldb
Injector: urlDir: urls-2
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 3
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 3
Injector: finished at 2017-02-27 18:33:42, elapsed: 00:00:03
Generator: starting at 2017-02-27 18:33:45
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: nature_crawl/segments/20170227183349
Generator: finished at 2017-02-27 18:33:51, elapsed: 00:00:05
Fetcher: starting at 2017-02-27 18:33:53
Fetcher: segment: nature_crawl/segments/20170227183349
Fetcher: threads: 3
Fetcher: time-out divisor: 2
QueueFeeder finished: total 3 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching https://www.theguardian.com/international (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.nature.com/ (queue crawl delay=1000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://www.geomar.de/ (queue crawl delay=1000ms)
robots.txt whitelist not configured.
robots.txt whitelist not configured.
robots.txt whitelist not configured.
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
.
.
.
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
fetch of http://www.nature.com/ failed with: java.net.SocketTimeoutException: Read timed out
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2017-02-27 18:34:18, elapsed: 00:00:24
ParseSegment: starting at 2017-02-27 18:34:21
ParseSegment: segment: nature_crawl/segments/20170227183349
Parsed (507ms):http://www.geomar.de/
Parsed (344ms):https://www.theguardian.com/international
ParseSegment: finished at 2017-02-27 18:34:24, elapsed: 00:00:03
CrawlDb update: starting at 2017-02-27 18:34:26
CrawlDb update: db: nature_crawl/crawldb
CrawlDb update: segments: [nature_crawl/segments/20170227183349]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2017-02-27 18:34:30, elapsed: 00:00:03

web-crawler

nutch

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-02-28 00:38:05

可以通过增加nutch-site.xml中的http超时设置来尝试。

<property>
  <name>http.timeout</name>
  <value>30000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

否则，检查该站点的robots.txt是否允许其页面的爬行。

票数 1

Stack Overflow用户

发布于 2017-02-27 20:03:58

不知道为什么，但是如果用户代理字符串包含"Nutch“，www.nature.com就会保持连接挂起。还可重复使用wget：

wget -U 'my-test-crawler/Nutch-1.13-SNAPSHOT (mydotmailatexampledotcom)' -d http://www.nature.com/

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42493248

复制

相似问题

问Nutch取超时
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Nutch取超时EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Nutch取超时
EN