文章/答案/技术大牛

发布

社区首页 >问答首页 >Java8 CompletedFuture web爬虫不会爬过一个URL

问Java8 CompletedFuture web爬虫不会爬过一个URL
EN

Stack Overflow用户

提问于 2015-01-08 23:24:10

回答 1查看 997关注 0票数 0

我正在使用Java 8中新引入的并发特性，这是Cay S. Horstmann著的“真正不耐烦的Java 8”一书中的工作练习。我使用新的CompletedFuture和汤汁创建了以下web爬虫。基本思想是给出一个URL，它将在该页面上找到第一个URLs，并重复这个过程n次。当然，M和n是参数。问题是程序获取初始页面的URL，但不递归。我遗漏了什么？

static class WebCrawler {
    CompletableFuture<Void> crawl(final String startingUrl,
        final int depth, final int breadth) {
        if (depth <= 0) {
            return completedFuture(startingUrl, depth);
        }

        final CompletableFuture<Void> allDoneFuture = allOf((CompletableFuture[]) of(
            startingUrl)
            .map(url -> supplyAsync(getContent(url)))
            .map(docFuture -> docFuture.thenApply(getURLs(breadth)))
            .map(urlsFuture -> urlsFuture.thenApply(doForEach(
                depth, breadth)))
            .toArray(size -> new CompletableFuture[size]));

        allDoneFuture.join();

        return allDoneFuture;
    }

    private CompletableFuture<Void> completedFuture(
        final String startingUrl, final int depth) {
        LOGGER.info("Link: {}, depth: {}.", startingUrl, depth);

        CompletableFuture<Void> future = new CompletableFuture<>();
        future.complete(null);

        return future;
    }

    private Supplier<Document> getContent(final String url) {
        return () -> {
            try {
                return connect(url).get();
            } catch (IOException e) {
                throw new UncheckedIOException(
                    " Something went wrong trying to fetch the contents of the URL: "
                        + url, e);
            }
        };
    }

    private Function<Document, Set<String>> getURLs(final int limit) {
        return doc -> {
            LOGGER.info("Getting URLs for document: {}.", doc.baseUri());

            return doc.select("a[href]").stream()
                .map(link -> link.attr("abs:href")).limit(limit)
                .peek(LOGGER::info).collect(toSet());
        };
    }

    private Function<Set<String>, Stream<CompletableFuture<Void>>> doForEach(
          final int depth, final int breadth) {
        return urls -> urls.stream().map(
            url -> crawl(url, depth - 1, breadth));
    }
}

测试用例：

@Test
public void testCrawl() {
    new WebCrawler().crawl(
        "http://en.wikipedia.org/wiki/Java_%28programming_language%29",
        2, 10);
}

completable-future

multithreading

concurrency

java-8

web-crawler

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-01-09 04:33:20

问题出现在以下代码中：

final CompletableFuture<Void> allDoneFuture = allOf(
  (CompletableFuture[]) of(startingUrl)
    .map(url -> supplyAsync(getContent(url)))
    .map(docFuture -> docFuture.thenApply(getURLs(breadth)))
    .map(urlsFuture -> urlsFuture.thenApply(doForEach(depth, breadth)))
    .toArray(size -> new CompletableFuture[size]));

由于某种原因，您在一个元素流中执行所有这些操作(这是练习的一部分吗？)结果是，allDoneFuture没有跟踪子任务的完成情况。它正在跟踪来自doForEach的doForEach的完成情况。但这股潮流已经做好了准备，其内部的未来从未被要求完成。

通过删除没有任何帮助的流来修复它：

final CompletableFuture<Void> allDoneFuture=supplyAsync(getContent(startingUrl))
    .thenApply(getURLs(breadth))
    .thenApply(doForEach(depth,breadth))
    .thenApply(futures -> futures.toArray(CompletableFuture[]::new))
    .thenCompose(CompletableFuture::allOf);

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/27851374

复制

相似问题

问Java8 CompletedFuture web爬虫不会爬过一个URL
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Java8 CompletedFuture web爬虫不会爬过一个URLEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Java8 CompletedFuture web爬虫不会爬过一个URL
EN