文章/答案/技术大牛

发布

社区首页 >问答首页 >扩展基本的web爬虫以过滤状态代码和HTML

问扩展基本的web爬虫以过滤状态代码和HTML
EN

Stack Overflow用户

提问于 2012-07-14 03:45:33

回答 1查看 1.4K关注 0票数 0

我遵循了一个用Java编写的基本网络爬虫的教程，并获得了一些基本功能。

目前，它只是从站点检索HTML并将其打印到控制台。我希望对它进行扩展，这样它就可以过滤出HTML页面标题和HTTP状态代码等细节。

我找到了这个库：http://htmlparser.sourceforge.net/ ...我认为它可能能够为我做这项工作，但我可以不使用外部库来做吗？

这是我到目前为止所知道的：

public static void main(String[] args) {

    // String representing the URL
    String input = "";

    // Check if argument added at command line
    if (args.length >= 1) {
        input = args[0];
    }

    // If no argument at command line use default
    else {
        input = "http://www.my_site.com/";
        System.out.println("\nNo argument entered so default of " + input
                + " used: \n");
    }
    // input test URL and read from file input stream
    try {

        testURL = new URL(input);
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                testURL.openStream()));

        // String variable to hold the returned content
        String line = "";

        // print content to console until no new lines of content
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
        }
    } catch (Exception e) {

        e.printStackTrace();
        System.out.println("Exception thrown");
    }
}

html

html-parsing

web-crawler

http-status-codes

java

回答 1

Stack Overflow用户

回答已采纳

发布于 2012-07-14 05:04:57

肯定有用于HTTP通信的工具。但是，如果您更喜欢自己实现一个--请查看java.net.HttpURLConnection。它将使您对HTTP通信进行更细粒度的控制。这是给你的一个小样本：

public static void main(String[] args) throws IOException
{
  URL url = new URL("http://www.google.com");
  HttpURLConnection connection = (HttpURLConnection) url.openConnection();

  connection.setRequestMethod("GET");

  String resp = getResponseBody(connection);

  System.out.println("RESPONSE CODE: " + connection.getResponseCode());
  System.out.println(resp);
}

private static String getResponseBody(HttpURLConnection connection)
    throws IOException
{
  try
  {
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        connection.getInputStream()));

    StringBuilder responseBody = new StringBuilder();
    String line = "";

    while ((line = reader.readLine()) != null)
    {
      responseBody.append(line + "\n");
    }

    reader.close();
    return responseBody.toString();
  }
  catch (IOException e)
  {
    e.printStackTrace();
    return "";
  }
}

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/11477341

复制

相似问题

问扩展基本的web爬虫以过滤状态代码和HTML
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问扩展基本的web爬虫以过滤状态代码和HTMLEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问扩展基本的web爬虫以过滤状态代码和HTML
EN