我试图使用JSoup解析以下URL的html:
http://brickseek.com/walmart-inventory-checker/当我执行程序时,我得到的是异常。我使用的是jsout-1.10.1.jar
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://brickseek.com/walmart-inventory-checker/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
at Third.main(Third.java:22)以下是该项目:
import java.io.IOException;
import org.jsoup.Connection.Method;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Third {
public static void main(String[] args) throws IOException {
String uniqueSku ="44656182";
String zipCode ="75160";
Response response = Jsoup.connect("http://brickseek.com/walmart-inventory-checker/")
.data("store_type","3", "sku", uniqueSku , "zip" , String.valueOf(zipCode) , "sort" , "distance")
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2")
.method(Method.POST)
.timeout(0)
.execute();
String rawHTML = response.body();
Document parsedDocument = Jsoup.parse(rawHTML);
Element bodyElement = parsedDocument.body();
Elements inStockTableElement = bodyElement.getElementsByTag("table");
}
}任何帮助都将不胜感激。
发布于 2017-04-01 21:06:50
服务器可能有某种方法来检测您是否正在使用bot来抓取页面。尝试将您的http头更改为如下所示:
public class Util {
public static Connection mask(Connection c) {
return c.header("Host", "brickseek.com")
.header("Connection", "keep-alive")
// .header("Content-Length", ""+c.request().requestBody().length())
.header("Cache-Control", "max-age=0")
.header("Origin", "https://brickseek.com/")
.header("Upgrade-Insecure-Requests", "1")
.header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
.referrer("http://brickseek.com/walmart-inventory-checker/")
.header("Accept-Encoding", "gzip, deflate, br")
.header("Accept-Language", "en-US,en;q=0.8");
}
}这个标题完全是从Google复制的--通常,机器人是通过不同的标题顺序或标题的不同大写来检测到的。通过准确地复制Google,您应该能够绕过它而不被发现。
一些bot检测算法计算每个IP的请求数量,并开始阻塞超过某个阈值--这就是为什么它仍然对某些人有效。
发布于 2017-07-24 11:15:39
只需在代码中添加ignoreHttpErrors(true)即可。
Response response = Jsoup.connect("http://brickseek.com/walmart-inventory-checker/")
.data("store_type","3", "sku", uniqueSku , "zip" , String.valueOf(zipCode) , "sort" , "distance")
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2")
.method(Method.POST)
.timeout(0).ignoreHttpErrors(true)
.execute();谢谢
https://stackoverflow.com/questions/40750860
复制相似问题