这是我的第一个异步/aiohttp网络刮刀器。这些天来,我正试图了解Python的异步/aiohttp库,我不确定我是否完全理解它。所以我想在这里做一些建设性的改进评论。
我正在抓取汤匙花,它包含了一些用于设计数据的公共API,以及每种面料类型数据的定价。我的挑战是获得设计名称,创建者的名字和每个设计的价格根据面料类型。设计名称和创建者名称来自此端点。
以及来自此端点的每种织物类型数据的其他定价。
[https://api-gateway.spoonflower.com/alpenrose/pricing/fabrics/FABRIC_+%20_类型%20+‘?数量=1&装运_country=PK[医]铁道cy=EUR&measurement_system=METRIC&design_id=‘%20 item(项目[’designId‘])%20’&页面_locale=en](https://api-gateway.spoonflower.com/alpenrose/pricing/fabrics/FABRIC_%27+%20fab_type%20+%27?quantity=1&shipping_country=PK[医]铁道cy=EUR&measurement_system=METRIC&design_id=%27%20str(item%5B%27designId%27%5D%29%20%27&page_locale=en)
每页有84个项目和24个织物类型。我首先要得到织物类型的所有名称,并存储在一个列表中。我可以循环浏览,动态更改网址,然后从设计页面中提取designName和screenName,最后提取价格数据。
这是我的代码:
import asyncio
import aiohttp
import json
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict
item_endpoint = 'https://pythias.spoonflower.com/search/v1/designs?lang=en&page_offset=0&sort=bestSelling&product=Fabric&forSale=true&showMatureContent=false&page_locale=en'
def get_fabric_names():
res = requests.get('https://www.spoonflower.com/spoonflower_fabrics')
soup = BeautifulSoup(res.text, 'lxml')
fabrics = [fabric.find('h2').text.strip() for fabric in soup.find_all('div', {'class': 'product_detail medium_text'})]
fabric = [("_".join(fab.upper().replace(u"\u2122", '').split())) for fab in fabrics]
for index in range(len(fabric)):
if 'COTTON_LAWN_(BETA)' in fabric[index]:
fabric[index] = 'COTTON_LAWN_APPAREL'
elif 'COTTON_POPLIN' in fabric[index]:
fabric[index] = 'COTTON_POPLIN_BRAVA'
elif 'ORGANIC_COTTON_KNIT' in fabric[index]:
fabric[index] = 'ORGANIC_COTTON_KNIT_PRIMA'
elif 'PERFORMANCE_PIQUÉ' in fabric[index]:
fabric[index] = 'PERFORMANCE_PIQUE'
elif 'CYPRESS_COTTON' in fabric[index]:
fabric[index] = 'CYPRESS_COTTON_BRAVA'
return fabric
async def fetch_design_endpoint(session, design_url):
async with session.get(design_url) as response:
extracting_endpoint = await response.text()
_json_object = json.loads(extracting_endpoint)
return _json_object['page_results']
async def fetch_pricing_data(session, pricing_endpoint):
async with session.get(pricing_endpoint) as response:
data_endpoint = await response.text()
_json_object = json.loads(data_endpoint)
items_dict = OrderedDict()
for item in await fetch_design_endpoint(session, item_endpoint):
designName = item['name']
screenName = item['user']['screenName']
fabric_name = _json_object['data']['fabric_code']
try:
test_swatch_meter = _json_object['data']['pricing']['TEST_SWATCH_METER']['price']
except:
test_swatch_meter = 'N/A'
try:
fat_quarter_meter = _json_object['data']['pricing']['FAT_QUARTER_METER']['price']
except:
fat_quarter_meter = 'N/A'
try:
meter = _json_object['data']['pricing']['METER']['price']
except:
meter = 'N/A'
#print(designName, screenName, fabric_name, test_swatch_meter,fat_quarter_meter, meter)
if (designName, screenName) not in items_dict.keys():
items_dict[(designName, screenName)] = {}
itemCount = len(items_dict[(designName, screenName)].values()) / 4
return items_dict[(designName, screenName)].update({'fabric_name_%02d' %itemCount: fabric_name,
'test_swatch_meter_%02d' %itemCount: test_swatch_meter,
'fat_quarter_meter_%02d' %itemCount: fat_quarter_meter,
'meter_%02d' %itemCount: meter})
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
fabric_type = get_fabric_names()
design_page = await fetch_design_endpoint(session, item_endpoint)
for item in design_page:
for fab_type in fabric_type[0:-3]:
pricing_url = 'https://api-gateway.spoonflower.com/alpenrose/pricing/fabrics/FABRIC_'+ fab_type +'?quantity=1&shipping_country=PK¤cy=EUR&measurement_system=METRIC&design_id='+str(item['designId'])+'&page_locale=en'
print(pricing_url)
await fetch_pricing_data(session, pricing_url)
tasks.append(asyncio.create_task(
fetch_pricing_data(session, pricing_url)
)
)
content = await asyncio.gather(*tasks)
return content
results = asyncio.run(main())
print(results)欢迎任何想法和建议,使这个刮刀更Pythonic和聪明。
发布于 2021-06-20 06:35:19
在其他答案中没有包括的东西:
main函数中的这两行表示请求和解析每个页面两次。这也意味着您的代码并不是真正的异步,因为您遍历了所有的结构,并且只循环了每个结构的await。你应该能移除第一个。
await fetch_pricing_data(session, pricing_url)
tasks.append(asyncio.create_task(fetch_pricing_data(session, pricing_url))此外,可以在该函数中生成异步任务,而不是当前的简单for循环:
for item in await fetch_design_endpoint(session, item_endpoint)一旦将该函数的返回值固定在循环之外,这可能会大大提高代码的速度。
但是,您的函数fetch_design_endpoint实际上并不依赖于fetch_pricing_data中的任何特定内容。这要么是一个bug (您想要更改URL吗?)或者只能检索一次结果。
发布于 2021-06-19 17:36:07
我不太熟悉异步操作,但我对您的程序有几点小小的观察。
在现代Python中,使用int对数组进行索引是一种代码味道。而不是
for index in range(len(fabric)):你可以说
for fabric_type in fabric:另一个观察是,除了块之外,您使用的是裸块,而不是显式地捕获"KeyError“。如果你这样做,你会节省读者一些精神上的努力,回去看看为什么你会期待一个问题会被提出。
https://codereview.stackexchange.com/questions/263221
复制相似问题