我正在尝试抓取网站上的数据,当我进入18+的页面时,我会得到一个警告页面。我的爬虫在大多数reddit页面上正常工作,我可以成功地获取数据。我尝试使用selenium移动到下一个页面,当它打开浏览器时它成功了,但是爬虫没有跟随到那个页面。下面是我的代码..
class DarknetmarketsSpider(scrapy.Spider):
name = "darknetmarkets"
allowed_domains = ["https://www.reddit.com"]
start_urls = (
'http://www.reddit.com/r/darknetmarkets',
)
rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=False),)
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get('http://www.reddit.com/r/darknetmarkets')
#self.driver.get('https://www.reddit.com/over18?dest=https%3A%2F%2Fwww.reddit.com%2Fr%2Fdarknetmarketsnoobs')
while True:
try:
YES_BUTTON = '//button[@value="yes"]'
next = self.driver.find_element_by_xpath(YES_BUTTON).click()
url = 'http://www.reddit.com/r/darknetmarkets'
next.click()
except:
break
self.driver.close()
item = darknetItem()
item['url'] = []
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item['url'].append(link.url)
print link按钮的css。
<button class="c-btn c-btn-primary" type="submit" name="over18" value="yes">continue</button>发布于 2016-05-04 22:25:22
我看到您试图绕过该子subreddit中的年龄限制屏幕。在你点击"continue“按钮后,这个选择被保存为cookie,所以你必须检索到scrapy。
在使用Selenium单击之后,保存cookie并将它们发送到scrapy
代码由scrapy authentication login with cookies提供
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = ['http://reddit.com/']
def get_cookies(self):
self.driver = webdriver.Firefox()
base_url = "http://www.reddit.com/r/darknetmarkets/"
self.driver.get(base_url)
self.driver.find_element_by_xpath("//button[@value='yes']").click()
cookies = self.driver.get_cookies()
self.driver.close()
return cookies
def parse(self, response):
yield scrapy.Request("http://www.reddit.com/r/darknetmarkets/",
cookies=self.get_cookies(),
callback=self.darkNetPage)https://stackoverflow.com/questions/36994417
复制相似问题