文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么我的简单的python网络爬虫运行的很慢？

问为什么我的简单的python网络爬虫运行的很慢？
EN

Stack Overflow用户

提问于 2020-07-31 18:35:21

回答 2查看 171关注 0票数 0

我正在尝试抓取大约34,000页。我计算了时间，发现请求每个页面平均需要5秒以上。因为我直接从API抓取数据，所以我只使用了requests包。有什么办法可以让我的爬虫加速吗？或者，如果不可能，我如何将爬虫部署到服务器？

下面是我的一些代码：

# Using python selenium to scrape sellers on shopee.co.id
# Crawl one seller -> Crawl all sellers in the list
# Sample URL: https://shopee.co.id/shop/38281755/search
# Sample API: https://shopee.co.id/api/v2/shop/get?shopid=38281755
import pandas as pd
import requests
import json
from datetime import datetime
import time

PATH_1 = '/Users/lixiangyi/FirstIntern/temp/seller_list.csv'
shop_list = pd.read_csv(PATH_1)
shop_ids = shop_list['shop'].tolist()
# print(seller_list)

# Downloading all APIs of shopee sellers:
api_links = []  # APIs of shops
item_links = []  # Links to click into
for shop_id in shop_ids:
    api_links.append('https://shopee.co.id/api/v2/shop/get?shopid=' + str(shop_id))
    item_links.append(
        f'https://shopee.co.id/api/v2/search_items/?by=pop&limit=10&match_id={shop_id}&newest=0&order=desc&page_type=shop&version=2'
    )
# print(api_links)


shop_names = []
shopid_list = []
founded_time = []
descriptions = []
i = 1

for api_link in api_links[0:100]:
    start_time = time.time()
    shop_info = requests.get(api_link)
    shopid_list.append(shop_info.text)
    print(i)
    i += 1
    end_time = time.time()
    print(end_time - start_time)

python

python-requests

web-crawler

回答 2

Stack Overflow用户

发布于 2020-07-31 19:37:24

您应该尝试使用线程或aiohttp包并行检索多个URL。使用线程：

更新

由于您的所有请求都是针对同一个网站的，因此使用requests.Session对象进行检索会更有效率。但是，无论您如何检索这些URL，在短时间内从同一IP地址向同一网站发出过多请求都可能被解释为拒绝服务攻击。

import requests
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import time

api_links = [] # this will have been filled in
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

shopid_list = []

def retrieve_url(session, url):
    shop_info = session.get(url)
    return shop_info.text


NUM_THREADS = 75 # experiment with this value
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    with requests.Session() as session:
        session.headers = headers
        # session will be the first argument to retrieve_url:
        worker = partial(retrieve_url, session)
        start_time = time.time()
        for result in executor.map(worker, api_links):
            shopid_list.append(result)
        end_time = time.time()
        print(end_time - start_time)

票数 2

Stack Overflow用户

发布于 2020-07-31 18:41:16

使用python urllib库

import urllib.request 
request_url = urllib.request.urlopen(some-url) 
print(request_url.read())

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63190244

复制

相似问题

问为什么我的简单的python网络爬虫运行的很慢？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我的简单的python网络爬虫运行的很慢？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我的简单的python网络爬虫运行的很慢？
EN