首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >为什么我在这些代理服务器上收到了400个坏请求?

为什么我在这些代理服务器上收到了400个坏请求?
EN

Stack Overflow用户
提问于 2015-11-07 20:58:52
回答 1查看 2K关注 0票数 0

所以我对网络和代理服务器的使用非常陌生。我有一个刮刀刮某些网站,但我意识到,我需要改变我的IP地址和诸如此类的,这样我就不会从网站引导。我在GitHub上找到了以下要使用的程序:

https://github.com/aivarsk/scrapy-proxies

我每件事的实施情况如下:

蜘蛛:

代码语言:javascript
复制
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from backpage_scrape import items
#from toolz import first
#import ipdb
#from lxml import html
from datetime import datetime, timedelta
import os

HOME = os.environ['HOMEPATH']
os.chdir(HOME +       "/Desktop/GitHub/Rover/backpage_scrape/backpage_scrape/spiders/") 

# Method that gets today's date
def backpage_date_today():
    now = datetime.utcnow() - timedelta(hours=4)
    weekdays = ['Mon. ','Tue. ','Wed. ','Thu. ','Fri. ','Sat. ','Sun. ']
    months = ['Jan. ','Feb. ','Mar. ','Apr. ','May. ', 'Jun. ','Jul. ','Aug. ','Sep. ','Oct. ','Nov. ','Dec. ']
    backpage_date = weekdays[now.weekday()] + months[now.month-1] + str(now.day)
    return backpage_date

# Method that gets yesterday's date
def backpage_date_yesterday():
    now = datetime.utcnow() - timedelta(days=1, hours=4)
    weekdays = ['Mon. ','Tue. ','Wed. ','Thu. ','Fri. ','Sat. ','Sun. ']
months = ['Jan. ','Feb. ','Mar. ','Apr. ','May. ', 'Jun. ','Jul. ','Aug. ','Sep. ','Oct. ','Nov. ','Dec. ']
backpage_date = weekdays[now.weekday()] + months[now.month-1] + str(now.day)
return backpage_date

# Open file which contains input urls
with open("test_urls.txt","rU") as infile:
    urls = [row.strip("\n") for row in infile]

class BackpageSpider(CrawlSpider):
name = 'backpage'
allowed_domains = ['backpage.com']
start_urls = urls

def parse(self,response):

    if response.status < 600:

        todays_links = []

        backpage_date = backpage_date_today()
        yesterday_date = backpage_date_yesterday()

        if backpage_date in response.body:
            # Get all URLs to iterate through
            todays_links = response.xpath("//div[@class='date'][1]/following-sibling::div[@class='date'][1]/preceding-sibling::div[preceding-sibling::div[@class='date']][contains(@class, 'cat')]/a/@href").extract()

        # timeOut = 0
        for url in todays_links: 
            # Iterate through pages and scrape
            # if timeOut == 10:
            #   time.sleep(600)
            #   timeOut = 0
            # else:
            #   timeOut += 1

            yield scrapy.Request(url,callback=self.parse_ad_into_content)

        for url in set(response.xpath('//a[@class="pagination next"]/@href').extract()):
            yield scrapy.Request(url,callback=self.parse)

    else:
        time.sleep(600)
        yield scrapy.Request(response.url,callback=self.parse)

# Parse page
def parse_ad_into_content(self,response):
    item = items.BackpageScrapeItem(url=response.url,
        backpage_id=response.url.split('.')[0].split('/')[2].encode('utf-8'),
        text = response.body,
        posting_body= response.xpath("//div[@class='postingBody']").extract()[0].encode('utf-8'),
        date = datetime.utcnow()-timedelta(hours=5),
        posted_date = response.xpath("//div[@class='adInfo']/text()").extract()[0].encode('utf-8'),
        posted_age = response.xpath("//p[@class='metaInfoDisplay']/text()").extract()[0].encode('utf-8'),
        posted_title = response.xpath("//div[@id='postingTitle']//h1/text()").extract()[0].encode('utf-8')
        )
    return item

Settings.py的部分:

代码语言:javascript
复制
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    # Fix path to this module
    'backpage_scrape.randomproxy.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

PROXY_LIST = 'C:/Users/LPrice/Desktop/GitHub/Rover/backpage_scrape/backpage_scrape/proxies.txt'

randomproxy.py与GitHub链接上的情况完全相同。

Proxies.txt:

代码语言:javascript
复制
https://6.hidemyass.com/ip-4
https://5.hidemyass.com/ip-1
https://4.hidemyass.com/ip-1
https://4.hidemyass.com/ip-2
https://4.hidemyass.com/ip-3
https://3.hidemyass.com/ip-1
https://3.hidemyass.com/ip-2
https://3.hidemyass.com/ip-3
https://2.hidemyass.com/ip-1
https://2.hidemyass.com/ip-2
https://2.hidemyass.com/ip-3
https://1.hidemyass.com/ip-1
https://1.hidemyass.com/ip-2
https://1.hidemyass.com/ip-3
https://1.hidemyass.com/ip-4
https://1.hidemyass.com/ip-5
https://1.hidemyass.com/ip-6
https://1.hidemyass.com/ip-7
https://1.hidemyass.com/ip-8

因此,如果您查看GitHub自述文件的顶部,您将看到它写着“复制粘贴到文本文件并将格式重新格式化为http://host:port格式”。我不知道我是怎么做到的,或者如果已经是这样的话。

就像我说的,我的错误是400个坏请求。我不确定它是否有用,但控制台说:

代码语言:javascript
复制
Retrying <GET http://sf.backpage.com/restOfURL> <failed 10 times>: 400 Bad Request

它是否应该在"sf.backpage.com“部分之前显示上面URL中的代理?

谢谢你抽出时间.我真的很感激你的帮助。

编辑:而且,我不确定在GitHub自述的自述文件底部插入代码片段的位置/方法。有关这方面的任何建议也将是有益的。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-11-07 22:17:55

proxies.txt中的URL并不是真正的代理。

转到http://proxylist.hidemyass.com/并搜索HTTP协议的代理。您需要从搜索结果中获取IP地址和端口列,并以http://IP地址:端口格式将它们写入http://IP文件。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/33587708

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档