下面是我编写的代码的一部分,用于在bikesales.com.au网站上搜索用于销售的自行车的详细信息(完整代码是这里)。它查找每个搜索页面上的所有' href‘属性,并尝试为每个对应于每个自行车的href请求html以供出售。我的代码工作正常,但是为了避免以下错误,我不得不添加一些带有指数退避的重试尝试:
ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)The代码正确地工作在,但是如果可能的话,我想避免退避方法。
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def get_html_content(url, multiplier=1):
"""
Retrieve the contents of the url.
"""
# Be a responisble scraper.
# The multiplier is used to exponentially increase the delay when there are several attempts at connecting to the url
time.sleep(2*multiplier)
# Get the html from the url
try:
with closing(get(url)) as resp:
content_type = resp.headers['Content-Type'].lower()
if is_good_response(resp):
return resp.content
else:
# Unable to get the url response
return None
except RequestException as e:
print("Error during requests to {0} : {1}".format(url, str(e)))
if __name__ == '__main__':
baseUrl = 'https://www.bikesales.com.au/'
url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'
content = get_html_content(url)
html = BeautifulSoup(content, 'html.parser')
BikeList = html.findAll("a", {"class": "item-link-container"})
# Cycle through the list of bikes on each search page.
for bike in BikeList:
# Get the URL for each bike.
individualBikeURL = bike.attrs['href']
BikeContent = get_html_content(baseUrl+individualBikeURL)
# Reset the miltipler for each new url
multiplier = 1
## occasionally the connection is lost, so try again.
## Im not sure why the connection is lost, i might be that the site is trying to guard against scraping software.
# If initial attempt to connect to the url was unsuccessful, try again with an increasing delay
while (BikeContent == None):
# Limit the exponential delay to 16x
if (multiplier < 16):
multiplier *= 2
BikeContent = get_html_content(baseUrl+individualBikeURL,multiplier)我的问题是,在执行这项要求时,我是否遗漏了一些东西?或者,这仅仅是网站拒绝使用刮取工具的结果吗?
发布于 2018-05-29 14:39:39
is_good_response只是在检查200个响应代码。is_good_response、get_html_content和for-循环内部合并在一起.这使得主要代码:
from requests import get
from bs4 import BeautifulSoup
if __name__ == '__main__':
baseUrl = 'https://www.bikesales.com.au/'
url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'
content = get_html_content(url)
html = BeautifulSoup(content, 'html.parser')
BikeList = html.findAll("a", {"class": "item-link-container"})
for bike in bike_list:
individualBikeURL = bike.attrs['href']
bike_content = get_bike(baseUrl+individualBikeURL)我们将把重点放在:
def get_bike(url):
multiplier = 1
while (BikeContent == None):
time.sleep(2*multiplier)
try:
with closing(get(url)) as resp:
content_type = resp.headers['Content-Type'].lower()
if 200 <= resp.status_code < 300:
return resp.content
except RequestException as e:
print("Error during requests to {0} : {1}".format(url, str(e)))
if (multiplier < 16):
multiplier *= 2
return None我们还可以添加另一个函数,以像以前的代码那样工作。
contextlib.closing作为Response.close "通常不需要显式调用。“content_type in get_bike。*args和**kwargs,以便在需要时可以使用requests.gets参数。post和其他请求方法作为参数,则可以允许使用该方法。import itertools
import collections.abc
import requests.exceptions
def request(method, retry=None, *args, **kwargs):
if retry is None:
retry = iter()
elif retry == -1:
retry = (2**i for i in itertools.count())
elif isinstance(retry, int):
retry = (2**i for i in range(retry))
elif isinstance(retry, collections.abc.Iterable):
pass
else:
raise ValueError('Unknown retry {retry}'.format(retry=retry))
for sleep in itertools.chain([0], retry):
if sleep:
time.sleep(sleep)
try:
resp = method(*args, **kwargs)
if 200 <= resp.status_code < 300:
return resp.content
except requests.exceptions.RequestException as e:
print('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def bike_retrys():
for i in range(5):
yield 2**i
while True:
yield 16若要改进其余代码,请执行以下操作:
import requests而不是from requests import get。request,这样使用就更干净了。import requests
from bs4 import BeautifulSoup
def get_bike(*args, **kwargs):
return request(requests.get, bike_retrys(), *args, **kwargs)
if __name__ == '__main__':
BASE_URL = 'https://www.bikesales.com.au/'
url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'
content = get_bike(url)
html = BeautifulSoup(content, 'html.parser')
bike_list = html.findAll("a", {"class": "item-link-container"})
for bike in bike_list:
bike_content = get_bike(BASE_URL + bike.attrs['href'])https://codereview.stackexchange.com/questions/195378
复制相似问题