文章/答案/技术大牛

发布

问Jsoup编辑刮板429错误
EN

Stack Overflow用户

提问于 2015-09-24 19:50:00

回答 2查看 2.9K关注 0票数 0

因此，我试图使用jsoup来抓取Reddit的图像，但是当我刮掉某些subreddits (如/r/壁纸)时，我得到了一个429错误，并想知道如何解决这个问题。完全理解这段代码是可怕的，这是一个非常糟糕的问题，但我对此完全陌生。无论如何：

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;

import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;

public class javascraper{

public static void main (String[]args) throws MalformedURLException
{
    Scanner scan = new Scanner (System.in);
    System.out.println("Where do you want to store the files?");
    String folderpath = scan.next();
    System.out.println("What subreddit do you want to scrape?");
    String subreddit = scan.next();
    subreddit = ("http://reddit.com/r/" + subreddit);
    new File(folderpath + "/" + subreddit).mkdir();

    //test

    try{
        //gets http protocol
        Document doc = Jsoup.connect(subreddit).timeout(0).get();

        //get page title
        String title = doc.title();
        System.out.println("title : " + title);

        //get all links
        Elements links = doc.select("a[href]");

        for(Element link : links){

            //get value from href attribute
            String checkLink = link.attr("href");
            Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
            if (imgCheck(checkLink)){ // checks to see if img link j
                System.out.println("link : " + link.attr("href"));
                downloadImages(checkLink, folderpath);
            }
        }
    }
    catch (IOException e){
        e.printStackTrace();
    }
}

public static boolean imgCheck(String http){
    String png = ".png";
    String jpg = ".jpg";
    String jpeg = "jpeg"; // no period so checker will only check last four characaters
    String gif = ".gif";
    int length = http.length();

    if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
        return true;
    }
    else{
        return false;
    }
}

private static void downloadImages(String src, String folderpath) throws IOException{
    String folder = null;

    //Exctract the name of the image from the src attribute

    int indexname = src.lastIndexOf("/");

    if (indexname == src.length()) {
        src = src.substring(1, indexname);
    }
    indexname = src.lastIndexOf("/");

    String name = src.substring(indexname, src.length());

    System.out.println(name);

    //Open a URL Stream

    URL url = new URL(src);

    InputStream in = url.openStream();

    OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));

    for (int b; (b = in.read()) != -1;) {

        out.write(b);

    }

    out.close();

    in.close();
}

}

jsoup

java

web-scraping

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-09-25 15:16:38

你的问题是由于你的刮刀违反了reddit的API规则。错误429意味着“太多的请求”-你请求太多的页面太快了。

您可以每2秒发出一个请求，还需要设置一个适当的用户代理 (他们推荐的格式是<platform>:<app ID>:<version string> (by /u/<reddit username>))。从目前的情况来看，您的代码运行得太快了，并且没有指定一个，所以它将受到严格的速率限制。

要修复它，首先，将其添加到类的开头，在主方法之前：

public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";

(确保指定一个实际的用户代理)。

然后，更改它(在downloadImages中)

URL url = new URL(src);
InputStream in = url.openStream();

对此：

URLConnection connection = (new URL(src)).openConnection();

Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);

InputStream in = connection.getInputStream();

您还需要更改(在main中)

Document doc = Jsoup.connect(subreddit).timeout(0).get();

对此：

Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();

然后，您的代码应该停止运行该错误。

请注意，使用reddit的API (IE，/r/subredid.json而不是/r/subreddit)可能会使这个项目更容易，但它并不是必需的，您的当前代码也可以工作。

票数 5

Stack Overflow用户

发布于 2015-09-25 11:39:38

当您查找维基百科时，429状态代码告诉您您有太多的请求：

用户在给定的时间内发送了太多的请求。适用于限速方案。

一个解决办法就是放慢你的刮刀速度。有一些选择，如何做到这一点，其中之一是使用睡眠。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32769754

复制

相似问题

问Jsoup编辑刮板429错误
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Jsoup编辑刮板429错误EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Jsoup编辑刮板429错误
EN