文章/答案/技术大牛

发布

社区首页 >问答首页 >并发多线程与请求

问并发多线程与请求
EN

Stack Overflow用户

提问于 2022-02-05 17:16:33

回答 2查看 400关注 0票数 1

我试图弄清楚如何使用多线程创建并发请求，同时使用请求库。我想从url的POST请求中获取链接和总页面。

然而，我正在迭代一个非常大的循环，因此它将花费非常长的时间。我尝试过的内容似乎并不能使请求并发，也不会产生输出。

以下是我尝试过的：

#smaller subset of my data

df = {'links': ['https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D687',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D492',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D499',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D702',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D6143'],
 'make': [138.0,138.0,138.0,138.0,138.0],
 'model': [687.0,492.0,499.0,702.0,6143.0],
 'country_id': [6.0,6.0,6.0,6.0,6.0]}

import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import threading
import gc



def get_links(url):
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    formal_data = defaultdict(list)
    for id_ in df['country_id']:
        for make in df['make']:
            for model in df['model']:
                data = {
                    'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
                    'tabs': '["t0"]'
                            }
                response = requests.post(url, headers=headers, data=data)
                test = json.loads(response.text)
                pages = round(int(test['context']['nb_results'])/27)
                if pages != 0:
                    formal_data['total_pages'].append(pages)
                    formal_data['links'].append(url)
                    print(f'You are on this link:{url}')
    return formal_data
threadLocal = threading.local()

with ThreadPool(8) as pool:
    urls = df['links']
    pool.map(get_links, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()

python

multithreading

python-requests

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-02-05 22:41:51

注意，异步使用requests的一种更现代的方法是使用其他库，比如请求-线程。

使用您的方法，您可以并行地连接到各种URL，但按顺序连接到每个URL。因此，您可能无法充分利用多线程。实际上，对于df['links']中的单个URL，您得到的结果与单个线程相同。避免这种情况的最简单的方法是使用itertools.product，它使迭代器成为本来是嵌套循环的迭代器。

import requests
from concurrent.futures import ThreadPoolExecutor as ThreadPool
from itertools import product

#   ... snipped df definition ...

def get_links(packed_pars):
    url, id_, make, model = packed_pars
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97","Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    data = {
        'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
        'tabs': '["t0"]'
    }
    response = requests.post(url, headers=headers, data=data)
    test = response.json()
    pages = round(int(test['context']['nb_results'])/27)
    if pages != 0:
        print(f'You are on this link:{url}, with {pages} pages')
    else:
        print("no pages")
    return url, pages


with ThreadPool(8) as pool:
    rv = pool.map(get_links, product(df['links'], df['country_id'], df['make'],
                                     df['model']))
    # This converts rv to the dict of the original post:
    formal_data = dict()
    filtered_list = [(url, pages) for url, pages in rv if pages]
    if filtered_list:
        formal_data['links'], formal_data['total_pages'] = zip(*filtered_list)
    else:  # Protect against empty answers
        formal_data['links'], formal_data['total_pages'] = [], []

至于为什么这不会产生任何输出:最后，根据问题中提供的数据，每次test['context']['nb_results']都是0。即使使用完整的数据集，每次查询也很可能返回零项。

其他一些评论：

不建议使用m̀ultiprocessing.pool.ThreadPool：您应该切换到concurrent.futures.ThreadPoolExecutor。
您根本没有使用threadLocal：它可以被删除。我不知道你会用它做什么。
您正在导入threading，但不使用它。
request响应有一个立即解析文本的json方法:在本例中不需要导入json。
您很可能希望ceil而不是round页面数。
由于您正在等待I/O，所以可以使用更多的线程而不是可用的核心。

票数 2

Stack Overflow用户

发布于 2022-02-05 17:44:02

因此，有一件事是，对于像使用受I/O约束的web这样的程序(在这里，性能影响是等待来自另一台机器/服务器/etc的请求)，更普遍的方法是使用异步编程。异步http请求的一个很好的库是httpx (还有其他库)。您将发现这些库的接口类似于requests，并且允许进行异步或同步，因此应该很容易使用。从这里开始，您将希望了解python中的异步pogramming。快速启动和异步以及其他好的教程都可以通过谷歌找到关于一般python异步编程的内容。

可以看到，这是其他python包装库喜欢异步的方法。

就像快速记录为什么异步是好的还是多处理的一样。这是：

异步实际上允许一个进程/线程执行程序的其他部分，就像等待输出的其他部分一样，因此基本上感觉所有的代码都是在parrellel中执行的。
实际上，multiprocessing正在启动单独的进程(我只是稍微转述了一点，但这是要点)，而且很可能不会像异步中那样获得相同的性能提升。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71000328

复制

相似问题

问并发多线程与请求
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问并发多线程与请求EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问并发多线程与请求
EN