首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >并发多线程与请求

并发多线程与请求
EN

Stack Overflow用户
提问于 2022-02-05 17:16:33
回答 2查看 400关注 0票数 1

我试图弄清楚如何使用多线程创建并发请求,同时使用请求库。我想从url的POST请求中获取链接和总页面。

然而,我正在迭代一个非常大的循环,因此它将花费非常长的时间。我尝试过的内容似乎并不能使请求并发,也不会产生输出。

以下是我尝试过的:

代码语言:javascript
复制
#smaller subset of my data

df = {'links': ['https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D687',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D492',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D499',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D702',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D6143'],
 'make': [138.0,138.0,138.0,138.0,138.0],
 'model': [687.0,492.0,499.0,702.0,6143.0],
 'country_id': [6.0,6.0,6.0,6.0,6.0]}

import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import threading
import gc



def get_links(url):
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    formal_data = defaultdict(list)
    for id_ in df['country_id']:
        for make in df['make']:
            for model in df['model']:
                data = {
                    'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
                    'tabs': '["t0"]'
                            }
                response = requests.post(url, headers=headers, data=data)
                test = json.loads(response.text)
                pages = round(int(test['context']['nb_results'])/27)
                if pages != 0:
                    formal_data['total_pages'].append(pages)
                    formal_data['links'].append(url)
                    print(f'You are on this link:{url}')
    return formal_data
threadLocal = threading.local()

with ThreadPool(8) as pool:
    urls = df['links']
    pool.map(get_links, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-02-05 22:41:51

注意,异步使用requests的一种更现代的方法是使用其他库,比如请求-线程

使用您的方法,您可以并行地连接到各种URL,但按顺序连接到每个URL。因此,您可能无法充分利用多线程。实际上,对于df['links']中的单个URL,您得到的结果与单个线程相同。避免这种情况的最简单的方法是使用itertools.product,它使迭代器成为本来是嵌套循环的迭代器。

代码语言:javascript
复制
import requests
from concurrent.futures import ThreadPoolExecutor as ThreadPool
from itertools import product

#   ... snipped df definition ...

def get_links(packed_pars):
    url, id_, make, model = packed_pars
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97","Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    data = {
        'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
        'tabs': '["t0"]'
    }
    response = requests.post(url, headers=headers, data=data)
    test = response.json()
    pages = round(int(test['context']['nb_results'])/27)
    if pages != 0:
        print(f'You are on this link:{url}, with {pages} pages')
    else:
        print("no pages")
    return url, pages


with ThreadPool(8) as pool:
    rv = pool.map(get_links, product(df['links'], df['country_id'], df['make'],
                                     df['model']))
    # This converts rv to the dict of the original post:
    formal_data = dict()
    filtered_list = [(url, pages) for url, pages in rv if pages]
    if filtered_list:
        formal_data['links'], formal_data['total_pages'] = zip(*filtered_list)
    else:  # Protect against empty answers
        formal_data['links'], formal_data['total_pages'] = [], []

至于为什么这不会产生任何输出:最后,根据问题中提供的数据,每次test['context']['nb_results']都是0。即使使用完整的数据集,每次查询也很可能返回零项。

其他一些评论:

  • 不建议使用m̀ultiprocessing.pool.ThreadPool:您应该切换到concurrent.futures.ThreadPoolExecutor。
  • 您根本没有使用threadLocal:它可以被删除。我不知道你会用它做什么。
  • 您正在导入threading,但不使用它。
  • request响应有一个立即解析文本的json方法:在本例中不需要导入json
  • 您很可能希望ceil而不是round页面数。
  • 由于您正在等待I/O,所以可以使用更多的线程而不是可用的核心。
票数 2
EN

Stack Overflow用户

发布于 2022-02-05 17:44:02

因此,有一件事是,对于像使用受I/O约束的web这样的程序(在这里,性能影响是等待来自另一台机器/服务器/etc的请求),更普遍的方法是使用异步编程。异步http请求的一个很好的库是httpx (还有其他库)。您将发现这些库的接口类似于requests,并且允许进行异步或同步,因此应该很容易使用。从这里开始,您将希望了解python中的异步pogramming。快速启动异步以及其他好的教程都可以通过谷歌找到关于一般python异步编程的内容。

可以看到,这是其他python包装库喜欢异步的方法。

就像快速记录为什么异步是好的还是多处理的一样。这是:

  1. 异步实际上允许一个进程/线程执行程序的其他部分,就像等待输出的其他部分一样,因此基本上感觉所有的代码都是在parrellel中执行的。
  2. 实际上,multiprocessing正在启动单独的进程(我只是稍微转述了一点,但这是要点),而且很可能不会像异步中那样获得相同的性能提升。
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71000328

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档