我正在为聊天机器人实现一个数据管道。我正在用scrapy爬行特定的subreddits来收集提交id(使用包装器是不可能的)。
更进一步,我使用praw递归地接收所有评论。这两种实现都已经起作用了。
但是,爬行subreddits会在几页之后被reddit拒绝(取决于get请求的速度,.)。
我不想破坏任何规则,但是在reddit规则中是否有适当的刮擦配置(DOWNLOAD_DELAY或其他节流机制)来收集这些信息?
我的刮痕蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
class RedditSpider(scrapy.Spider):
name = 'reddit'
allowed_domains = ["reddit.com"]
def __init__(self, subreddit=None, pages=None, *args, **kwargs):
super(RedditSpider, self).__init__(*args, **kwargs)
self.start_urls = ['https://www.reddit.com/r/%s/new/' % subreddit]
self.pages = int(pages)
self.page_count = 0
def parse(self, response):
# Extracting the content using css selectors
titles = response.css('.title.may-blank::text').extract()
# votes = response.css('.score.unvoted::text').extract()
# times = response.css('time::attr(title)').extract()
# comments = response.css('.comments::text').extract()
submission_id = response.css('.title.may-blank').xpath('@data-outbound-url').extract()
# submission_id = submission_id[24:33]
# Give the extracted content row wise
# for item in zip(titles, votes, times, comments, titles_full):
for item in zip(titles, submission_id):
# create a dictionary to store the scraped info
scraped_info = {
'title': item[0],
'submission_id': item[1][23:32]
# 'vote': item[2],
# 'created_at': item[3],
# 'comments': item[4]
}
# yield or give the scraped info to scrapy
yield scraped_info
if (self.pages > 1) and (self.page_count < self.pages):
self.page_count += 1
next_page = response.css('span.next-button a::attr(href)').extract_first()
if next_page is not None:
print("next page ... " + next_page)
yield response.follow(next_page, callback=self.parse)
if next_page is None:
print("no more pages ... lol")我的蜘蛛配置:
# -*- coding: utf-8 -*-
# Scrapy settings for reddit_crawler_scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'reddit_crawler_scrapy'
SPIDER_MODULES = ['reddit_crawler_scrapy.spiders']
NEWSPIDER_MODULE = 'reddit_crawler_scrapy.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'reddit_crawler_scrapy university project m.reichart@hotmail.com'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'reddit_crawler_scrapy.middlewares.RedditCrawlerScrapySpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'reddit_crawler_scrapy.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'reddit_crawler_scrapy.pipelines.RedditCrawlerScrapyPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
#Export as CSV Feed
FEED_FORMAT = "csv"
FEED_URI = "reddit.csv"
# RANDOMIZE_DOWNLOAD_DELAY = False
LOG_FILE='scrapy_log.txt'我已经将DOWNLOAD_DELAY设置为5秒,由RANDOMIZE_DOWNLOAD_DELAY定义的方法乘以0.5到1.5之间的随机数。这等于在get请求/下载2.5秒到7.5秒之间,这已经很慢了,但是会在几个小时/天内完成任务。
尽管如此,在几页之后,我没有收到下一页,最后一页让我找到了reddit提供的一份投稿,并提供了一个链接如何正确地设置机器人(imho以讽刺的语气-很好地使用reddit)。
发布于 2017-12-19 19:56:34
国际海事组织的反爬行机制将花费你太多的时间,我不会试图走这条路。
他们有一个API来获取一个subreddit的所有帖子,例如https://www.reddit.com/r/subreddit/top.json?sort=top以json格式获取/r/subreddit的所有帖子,它看起来与您在他们的网站上看到的内容相同。
另外,他们的医生建议您使用oauth。然后他们让你每分钟做60次请求。我宁愿走这条路。这也比抓取安全得多,因为每当他们改变HTML布局中的内容时,刮擦就会掉下来。
https://stackoverflow.com/questions/47894052
复制相似问题