我遵循了一个用Java编写的基本网络爬虫的教程,并获得了一些基本功能。
目前,它只是从站点检索HTML并将其打印到控制台。我希望对它进行扩展,这样它就可以过滤出HTML页面标题和HTTP状态代码等细节。
我找到了这个库:http://htmlparser.sourceforge.net/ ...我认为它可能能够为我做这项工作,但我可以不使用外部库来做吗?
这是我到目前为止所知道的:
public static void main(String[] args) {
// String representing the URL
String input = "";
// Check if argument added at command line
if (args.length >= 1) {
input = args[0];
}
// If no argument at command line use default
else {
input = "http://www.my_site.com/";
System.out.println("\nNo argument entered so default of " + input
+ " used: \n");
}
// input test URL and read from file input stream
try {
testURL = new URL(input);
BufferedReader reader = new BufferedReader(new InputStreamReader(
testURL.openStream()));
// String variable to hold the returned content
String line = "";
// print content to console until no new lines of content
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
System.out.println("Exception thrown");
}
}发布于 2012-07-14 05:04:57
肯定有用于HTTP通信的工具。但是,如果您更喜欢自己实现一个--请查看java.net.HttpURLConnection。它将使您对HTTP通信进行更细粒度的控制。这是给你的一个小样本:
public static void main(String[] args) throws IOException
{
URL url = new URL("http://www.google.com");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
String resp = getResponseBody(connection);
System.out.println("RESPONSE CODE: " + connection.getResponseCode());
System.out.println(resp);
}
private static String getResponseBody(HttpURLConnection connection)
throws IOException
{
try
{
BufferedReader reader = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
StringBuilder responseBody = new StringBuilder();
String line = "";
while ((line = reader.readLine()) != null)
{
responseBody.append(line + "\n");
}
reader.close();
return responseBody.toString();
}
catch (IOException e)
{
e.printStackTrace();
return "";
}
}https://stackoverflow.com/questions/11477341
复制相似问题