我试图弄清楚如何使用多线程创建并发请求,同时使用请求库。我想从url的POST请求中获取链接和总页面。
然而,我正在迭代一个非常大的循环,因此它将花费非常长的时间。我尝试过的内容似乎并不能使请求并发,也不会产生输出。
以下是我尝试过的:
#smaller subset of my data
df = {'links': ['https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D687',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D492',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D499',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D702',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D6143'],
'make': [138.0,138.0,138.0,138.0,138.0],
'model': [687.0,492.0,499.0,702.0,6143.0],
'country_id': [6.0,6.0,6.0,6.0,6.0]}
import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import threading
import gc
def get_links(url):
headers = {
'authority': 'www.theparking.eu',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
'accept': '*/*',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'x-requested-with': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
'sec-ch-ua-platform': '"macOS"',
'origin': 'https://www.theparking.eu',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.theparking.eu/used-cars/used-cars/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
formal_data = defaultdict(list)
for id_ in df['country_id']:
for make in df['make']:
for model in df['model']:
data = {
'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
'tabs': '["t0"]'
}
response = requests.post(url, headers=headers, data=data)
test = json.loads(response.text)
pages = round(int(test['context']['nb_results'])/27)
if pages != 0:
formal_data['total_pages'].append(pages)
formal_data['links'].append(url)
print(f'You are on this link:{url}')
return formal_data
threadLocal = threading.local()
with ThreadPool(8) as pool:
urls = df['links']
pool.map(get_links, urls)
# must be done before terminate is explicitly or implicitly called on the pool:
del threadLocal
gc.collect()发布于 2022-02-05 22:41:51
注意,异步使用requests的一种更现代的方法是使用其他库,比如请求-线程。
使用您的方法,您可以并行地连接到各种URL,但按顺序连接到每个URL。因此,您可能无法充分利用多线程。实际上,对于df['links']中的单个URL,您得到的结果与单个线程相同。避免这种情况的最简单的方法是使用itertools.product,它使迭代器成为本来是嵌套循环的迭代器。
import requests
from concurrent.futures import ThreadPoolExecutor as ThreadPool
from itertools import product
# ... snipped df definition ...
def get_links(packed_pars):
url, id_, make, model = packed_pars
headers = {
'authority': 'www.theparking.eu',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97","Chromium";v="97"',
'accept': '*/*',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'x-requested-with': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
'sec-ch-ua-platform': '"macOS"',
'origin': 'https://www.theparking.eu',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.theparking.eu/used-cars/used-cars/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
data = {
'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
'tabs': '["t0"]'
}
response = requests.post(url, headers=headers, data=data)
test = response.json()
pages = round(int(test['context']['nb_results'])/27)
if pages != 0:
print(f'You are on this link:{url}, with {pages} pages')
else:
print("no pages")
return url, pages
with ThreadPool(8) as pool:
rv = pool.map(get_links, product(df['links'], df['country_id'], df['make'],
df['model']))
# This converts rv to the dict of the original post:
formal_data = dict()
filtered_list = [(url, pages) for url, pages in rv if pages]
if filtered_list:
formal_data['links'], formal_data['total_pages'] = zip(*filtered_list)
else: # Protect against empty answers
formal_data['links'], formal_data['total_pages'] = [], []至于为什么这不会产生任何输出:最后,根据问题中提供的数据,每次test['context']['nb_results']都是0。即使使用完整的数据集,每次查询也很可能返回零项。
其他一些评论:
m̀ultiprocessing.pool.ThreadPool:您应该切换到concurrent.futures.ThreadPoolExecutor。threadLocal:它可以被删除。我不知道你会用它做什么。threading,但不使用它。request响应有一个立即解析文本的json方法:在本例中不需要导入json。ceil而不是round页面数。发布于 2022-02-05 17:44:02
因此,有一件事是,对于像使用受I/O约束的web这样的程序(在这里,性能影响是等待来自另一台机器/服务器/etc的请求),更普遍的方法是使用异步编程。异步http请求的一个很好的库是httpx (还有其他库)。您将发现这些库的接口类似于requests,并且允许进行异步或同步,因此应该很容易使用。从这里开始,您将希望了解python中的异步pogramming。快速启动和异步以及其他好的教程都可以通过谷歌找到关于一般python异步编程的内容。
可以看到,这是其他python包装库喜欢异步的方法。
就像快速记录为什么异步是好的还是多处理的一样。这是:
https://stackoverflow.com/questions/71000328
复制相似问题