首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用Python 3抓取web数据

用Python 3抓取web数据
EN

Code Review用户
提问于 2018-05-29 01:25:09
回答 1查看 402关注 0票数 5

下面是我编写的代码的一部分,用于在bikesales.com.au网站上搜索用于销售的自行车的详细信息(完整代码是这里)。它查找每个搜索页面上的所有' href‘属性,并尝试为每个对应于每个自行车的href请求html以供出售。我的代码工作正常,但是为了避免以下错误,我不得不添加一些带有指数退避的重试尝试:

代码语言:javascript
复制
ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)

The代码正确地工作在,但是如果可能的话,我想避免退避方法。

代码语言:javascript
复制
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def get_html_content(url, multiplier=1):
    """
    Retrieve the contents of the url.
    """
    # Be a responisble scraper.
    # The multiplier is used to exponentially increase the delay when there are several attempts at connecting to the url
    time.sleep(2*multiplier)

    # Get the html from the url
    try:
        with closing(get(url)) as resp:
            content_type = resp.headers['Content-Type'].lower()
            if is_good_response(resp):
                return resp.content
            else:
                # Unable to get the url response
                return None

    except RequestException as e:
        print("Error during requests to {0} : {1}".format(url, str(e)))

if __name__ == '__main__':

    baseUrl = 'https://www.bikesales.com.au/'
    url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'

    content = get_html_content(url)
    html = BeautifulSoup(content, 'html.parser')
    BikeList = html.findAll("a", {"class": "item-link-container"})

    # Cycle through the list of bikes on each search page.
    for bike in BikeList:

        # Get the URL for each bike.
        individualBikeURL = bike.attrs['href']
        BikeContent = get_html_content(baseUrl+individualBikeURL)

        # Reset the miltipler for each new url
        multiplier = 1

        ## occasionally the connection is lost, so try again.
        ## Im not sure why the connection is lost, i might be that the site is trying to guard against scraping software.

        # If initial attempt to connect to the url was unsuccessful, try again with an increasing delay
        while (BikeContent == None):
            # Limit the exponential delay to 16x
            if (multiplier < 16):
                multiplier *= 2
            BikeContent = get_html_content(baseUrl+individualBikeURL,multiplier)

我的问题是,在执行这项要求时,我是否遗漏了一些东西?或者,这仅仅是网站拒绝使用刮取工具的结果吗?

EN

回答 1

Code Review用户

回答已采纳

发布于 2018-05-29 14:39:39

  1. 我想is_good_response只是在检查200个响应代码。
  2. 将主循环中的is_good_responseget_html_content和for-循环内部合并在一起.

这使得主要代码:

代码语言:javascript
复制
from requests import get
from bs4 import BeautifulSoup

if __name__ == '__main__':
    baseUrl = 'https://www.bikesales.com.au/'
    url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'

    content = get_html_content(url)
    html = BeautifulSoup(content, 'html.parser')
    BikeList = html.findAll("a", {"class": "item-link-container"})

    for bike in bike_list:
        individualBikeURL = bike.attrs['href']
        bike_content = get_bike(baseUrl+individualBikeURL)

我们将把重点放在:

代码语言:javascript
复制
def get_bike(url):
    multiplier = 1
    while (BikeContent == None):
        time.sleep(2*multiplier)
        try:
            with closing(get(url)) as resp:
                content_type = resp.headers['Content-Type'].lower()
                if 200 <= resp.status_code < 300:
                    return resp.content
        except RequestException as e:
            print("Error during requests to {0} : {1}".format(url, str(e)))
        if (multiplier < 16):
            multiplier *= 2
    return None
  1. 允许重试参数。Retry还应该对不同的值执行不同的操作:
    • 没有-别再试了。
    • -1 -无限重试
    • 重试直到$2^n\$。
    • 迭代器-循环通过的延迟

我们还可以添加另一个函数,以像以前的代码那样工作。

  1. 您不需要使用contextlib.closing作为Response.close "通常不需要显式调用。
  2. 您不需要content_type in get_bike
  3. 您应该使用*args**kwargs,以便在需要时可以使用requests.gets参数。
  4. 如果将post和其他请求方法作为参数,则可以允许使用该方法。
代码语言:javascript
复制
import itertools
import collections.abc

import requests.exceptions


def request(method, retry=None, *args, **kwargs):
    if retry is None:
        retry = iter()
    elif retry == -1:
        retry = (2**i for i in itertools.count())
    elif isinstance(retry, int):
        retry = (2**i for i in range(retry))
    elif isinstance(retry, collections.abc.Iterable):
        pass
    else:
        raise ValueError('Unknown retry {retry}'.format(retry=retry))

    for sleep in itertools.chain([0], retry):
        if sleep:
            time.sleep(sleep)
        try:
            resp = method(*args, **kwargs)
            if 200 <= resp.status_code < 300:
                return resp.content
        except requests.exceptions.RequestException as e:
            print('Error during requests to {0} : {1}'.format(url, str(e)))
    return None


def bike_retrys():
    for i in range(5):
        yield 2**i
    while True:
        yield 16

若要改进其余代码,请执行以下操作:

  1. 用蛇箱。
  2. 常量应在上蛇情况下。
  3. 使用上面的代码。
  4. 使用import requests而不是from requests import get
  5. 您可以创建一个小助手函数来调用request,这样使用就更干净了。
代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup


def get_bike(*args, **kwargs):
    return request(requests.get, bike_retrys(), *args, **kwargs)


if __name__ == '__main__':
    BASE_URL = 'https://www.bikesales.com.au/'
    url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'

    content = get_bike(url)
    html = BeautifulSoup(content, 'html.parser')
    bike_list = html.findAll("a", {"class": "item-link-container"})

    for bike in bike_list:
        bike_content = get_bike(BASE_URL + bike.attrs['href'])
票数 6
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/195378

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档