首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python: Scrapy和Reddit

Python: Scrapy和Reddit
EN

Stack Overflow用户
提问于 2017-12-19 19:35:55
回答 1查看 2.5K关注 0票数 1

我正在为聊天机器人实现一个数据管道。我正在用scrapy爬行特定的subreddits来收集提交id(使用包装器是不可能的)。

更进一步,我使用praw递归地接收所有评论。这两种实现都已经起作用了。

但是,爬行subreddits会在几页之后被reddit拒绝(取决于get请求的速度,.)。

我不想破坏任何规则,但是在reddit规则中是否有适当的刮擦配置(DOWNLOAD_DELAY或其他节流机制)来收集这些信息?

我的刮痕蜘蛛:

代码语言:javascript
复制
# -*- coding: utf-8 -*-
import scrapy

class RedditSpider(scrapy.Spider):
    name = 'reddit'
    allowed_domains = ["reddit.com"]

    def __init__(self, subreddit=None, pages=None, *args, **kwargs):
        super(RedditSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['https://www.reddit.com/r/%s/new/' % subreddit]
        self.pages = int(pages)
        self.page_count = 0

    def parse(self, response):

        # Extracting the content using css selectors
        titles = response.css('.title.may-blank::text').extract()
        # votes = response.css('.score.unvoted::text').extract()
        # times = response.css('time::attr(title)').extract()
        # comments = response.css('.comments::text').extract()
        submission_id = response.css('.title.may-blank').xpath('@data-outbound-url').extract()
        # submission_id = submission_id[24:33]

        # Give the extracted content row wise
        # for item in zip(titles, votes, times, comments, titles_full):
        for item in zip(titles, submission_id):
            # create a dictionary to store the scraped info
            scraped_info = {
                'title': item[0],
                'submission_id': item[1][23:32]
                # 'vote': item[2],
                # 'created_at': item[3],
                # 'comments': item[4]
            }

            # yield or give the scraped info to scrapy
            yield scraped_info

        if (self.pages > 1) and (self.page_count < self.pages):
            self.page_count += 1
            next_page = response.css('span.next-button a::attr(href)').extract_first()
            if next_page is not None:
                print("next page ... " + next_page)
                yield response.follow(next_page, callback=self.parse)

            if next_page is None:
                print("no more pages ... lol")

我的蜘蛛配置:

代码语言:javascript
复制
# -*- coding: utf-8 -*-

# Scrapy settings for reddit_crawler_scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'reddit_crawler_scrapy'

SPIDER_MODULES = ['reddit_crawler_scrapy.spiders']
NEWSPIDER_MODULE = 'reddit_crawler_scrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'reddit_crawler_scrapy university project m.reichart@hotmail.com'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'reddit_crawler_scrapy.middlewares.RedditCrawlerScrapySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'reddit_crawler_scrapy.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'reddit_crawler_scrapy.pipelines.RedditCrawlerScrapyPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#Export as CSV Feed
FEED_FORMAT = "csv"
FEED_URI = "reddit.csv"


# RANDOMIZE_DOWNLOAD_DELAY = False

LOG_FILE='scrapy_log.txt'

我已经将DOWNLOAD_DELAY设置为5秒,由RANDOMIZE_DOWNLOAD_DELAY定义的方法乘以0.5到1.5之间的随机数。这等于在get请求/下载2.5秒到7.5秒之间,这已经很慢了,但是会在几个小时/天内完成任务。

尽管如此,在几页之后,我没有收到下一页,最后一页让我找到了reddit提供的一份投稿,并提供了一个链接如何正确地设置机器人(imho以讽刺的语气-很好地使用reddit)。

EN

回答 1

Stack Overflow用户

发布于 2017-12-19 19:56:34

国际海事组织的反爬行机制将花费你太多的时间,我不会试图走这条路。

他们有一个API来获取一个subreddit的所有帖子,例如https://www.reddit.com/r/subreddit/top.json?sort=top以json格式获取/r/subreddit的所有帖子,它看起来与您在他们的网站上看到的内容相同。

另外,他们的医生建议您使用oauth。然后他们让你每分钟做60次请求。我宁愿走这条路。这也比抓取安全得多,因为每当他们改变HTML布局中的内容时,刮擦就会掉下来。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47894052

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档