文章/答案/技术大牛

发布

社区首页 >问答首页 >异步刮取织物信息

问异步刮取织物信息
EN

Code Review用户

提问于 2021-06-19 17:02:57

回答 2查看 629关注 0票数 11

这是我的第一个异步/aiohttp网络刮刀器。这些天来，我正试图了解Python的异步/aiohttp库，我不确定我是否完全理解它。所以我想在这里做一些建设性的改进评论。

我正在抓取汤匙花，它包含了一些用于设计数据的公共API，以及每种面料类型数据的定价。我的挑战是获得设计名称，创建者的名字和每个设计的价格根据面料类型。设计名称和创建者名称来自此端点。

https://pythias.spoonflower.com/search/v1/designs?lang=en&page_offset=0&sort=bestSelling&product=Fabric&forSale=true&showMatureContent=false&page_locale=en

以及来自此端点的每种织物类型数据的其他定价。

[https://api-gateway.spoonflower.com/alpenrose/pricing/fabrics/FABRIC_+%20_类型%20+‘？数量=1&装运_country=PK[医]铁道cy=EUR&measurement_system=METRIC&design_id=‘%20 item(项目[’designId‘])%20’&页面_locale=en](https://api-gateway.spoonflower.com/alpenrose/pricing/fabrics/FABRIC_%27+%20fab_type%20+%27?quantity=1&shipping_country=PK[医]铁道cy=EUR&measurement_system=METRIC&design_id=%27%20str(item%5B%27designId%27%5D%29%20%27&page_locale=en)

每页有84个项目和24个织物类型。我首先要得到织物类型的所有名称，并存储在一个列表中。我可以循环浏览，动态更改网址，然后从设计页面中提取designName和screenName，最后提取价格数据。

这是我的代码：

import asyncio
import aiohttp
import json
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict


item_endpoint = 'https://pythias.spoonflower.com/search/v1/designs?lang=en&page_offset=0&sort=bestSelling&product=Fabric&forSale=true&showMatureContent=false&page_locale=en'

def get_fabric_names():
    res = requests.get('https://www.spoonflower.com/spoonflower_fabrics')
    soup = BeautifulSoup(res.text, 'lxml')
    fabrics = [fabric.find('h2').text.strip() for fabric in soup.find_all('div', {'class': 'product_detail medium_text'})]
    fabric = [("_".join(fab.upper().replace(u"\u2122", '').split())) for fab in fabrics]
    for index in range(len(fabric)):
        if 'COTTON_LAWN_(BETA)' in fabric[index]:
            fabric[index] = 'COTTON_LAWN_APPAREL'
        elif 'COTTON_POPLIN' in fabric[index]:
            fabric[index] = 'COTTON_POPLIN_BRAVA'
        elif 'ORGANIC_COTTON_KNIT' in fabric[index]:
            fabric[index] = 'ORGANIC_COTTON_KNIT_PRIMA'
        elif 'PERFORMANCE_PIQUÉ' in fabric[index]:
            fabric[index] = 'PERFORMANCE_PIQUE'
        elif 'CYPRESS_COTTON' in fabric[index]:
            fabric[index] = 'CYPRESS_COTTON_BRAVA'
    return fabric

async def fetch_design_endpoint(session, design_url):
    async with session.get(design_url) as response:
        extracting_endpoint = await response.text()
        _json_object = json.loads(extracting_endpoint)
        return _json_object['page_results']

async def fetch_pricing_data(session, pricing_endpoint):
    async with session.get(pricing_endpoint) as response:
        data_endpoint = await response.text()
        _json_object = json.loads(data_endpoint)
        items_dict = OrderedDict()
        for item in await fetch_design_endpoint(session, item_endpoint):
            designName = item['name']
            screenName = item['user']['screenName']
            fabric_name = _json_object['data']['fabric_code']
            try:
                test_swatch_meter = _json_object['data']['pricing']['TEST_SWATCH_METER']['price']
            except:
                test_swatch_meter = 'N/A'
            try:
                fat_quarter_meter = _json_object['data']['pricing']['FAT_QUARTER_METER']['price']
            except:
                fat_quarter_meter = 'N/A'
            try:
                meter = _json_object['data']['pricing']['METER']['price']
            except:
                meter = 'N/A'

            
            #print(designName, screenName, fabric_name, test_swatch_meter,fat_quarter_meter, meter)

            if (designName, screenName) not in items_dict.keys():
                items_dict[(designName, screenName)] = {}
            itemCount = len(items_dict[(designName, screenName)].values()) / 4
            return items_dict[(designName, screenName)].update({'fabric_name_%02d' %itemCount: fabric_name,
            'test_swatch_meter_%02d' %itemCount: test_swatch_meter,
            'fat_quarter_meter_%02d' %itemCount: fat_quarter_meter,
            'meter_%02d' %itemCount: meter})
                

        

async def main():
    tasks = []
    async with aiohttp.ClientSession() as session:
        fabric_type = get_fabric_names()
        design_page = await fetch_design_endpoint(session, item_endpoint)
        for item in design_page:
            for fab_type in fabric_type[0:-3]:
                pricing_url = 'https://api-gateway.spoonflower.com/alpenrose/pricing/fabrics/FABRIC_'+ fab_type +'?quantity=1&shipping_country=PK¤cy=EUR&measurement_system=METRIC&design_id='+str(item['designId'])+'&page_locale=en'
                print(pricing_url)
                await fetch_pricing_data(session, pricing_url)

                tasks.append(asyncio.create_task(
                    fetch_pricing_data(session, pricing_url)

                    )
                )

        content = await asyncio.gather(*tasks)
        return content
results = asyncio.run(main())
print(results)

欢迎任何想法和建议，使这个刮刀更Pythonic和聪明。

python

python-3.x

web-scraping

async-await

回答 2

Code Review用户

发布于 2021-06-20 06:35:19

在其他答案中没有包括的东西：

main函数中的这两行表示请求和解析每个页面两次。这也意味着您的代码并不是真正的异步，因为您遍历了所有的结构，并且只循环了每个结构的await。你应该能移除第一个。

await fetch_pricing_data(session, pricing_url)
tasks.append(asyncio.create_task(fetch_pricing_data(session, pricing_url))

此外，可以在该函数中生成异步任务，而不是当前的简单for循环：

for item in await fetch_design_endpoint(session, item_endpoint)

一旦将该函数的返回值固定在循环之外，这可能会大大提高代码的速度。

但是，您的函数fetch_design_endpoint实际上并不依赖于fetch_pricing_data中的任何特定内容。这要么是一个bug (您想要更改URL吗？)或者只能检索一次结果。

票数 5

Code Review用户

发布于 2021-06-19 17:36:07

我不太熟悉异步操作，但我对您的程序有几点小小的观察。

在现代Python中，使用int对数组进行索引是一种代码味道。而不是

for index in range(len(fabric)):

你可以说

for fabric_type in fabric:

另一个观察是，除了块之外，您使用的是裸块，而不是显式地捕获"KeyError“。如果你这样做，你会节省读者一些精神上的努力，回去看看为什么你会期待一个问题会被提出。

票数 3

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/263221

复制

相似问题

问异步刮取织物信息
EN

回答 2

Code Review用户

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问异步刮取织物信息EN

回答 2

Code Review用户

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问异步刮取织物信息
EN