文章/答案/技术大牛

发布

社区首页 >问答首页 >多线程，加快下载速度

问多线程，加快下载速度
EN

Stack Overflow用户

提问于 2012-05-09 04:03:02

回答 3查看 13.5K关注 0票数 1

如何同时下载多个链接？我下面的脚本可以工作，但一次只能下载一个，而且速度非常慢。我不知道如何在我的脚本中加入多线程。

Python脚本：

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)

名为links.html的文件

<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>

python

beautifulsoup

lxml

urllib2

urllib

回答 3

Stack Overflow用户

回答已采纳

发布于 2012-05-09 04:14:03

在我看来，这是消费者-生产者的问题--参见维基百科

您可以使用

import Queue, thread

# create a Queue.Queue here
queue = Queue.Queue()

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  queue.put(url) # produce




def thrad():
  url = queue.get() # consume
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'wb').write(converted)
  print(name)

thread.start_new(thrad, ()) # start 1 threads

票数 1

Stack Overflow用户

发布于 2012-05-09 04:27:04

我使用multiprocessing进行并行化--出于某种原因，我比threading更喜欢它

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
import multiprocessing


print ("downloading and parsing Bibles...")
def download_stuff(link):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)

root = html.parse(open('links.html'))
links = root.findall('//a')
pool = multiprocessing.Pool(processes=5) #use 5 processes to download the data
output = pool.map(download_stuff,links)  #output is a list of [None,None,...] since download_stuff doesn't return anything

票数 9

Stack Overflow用户

发布于 2017-12-30 19:30:34

在2017年，现在还有其他一些选择，比如asyncio和ThreadPoolExecutor。

下面是一个ThreadPoolExecutor的例子(包含在未来的Python中)

from concurrent.futures import ThreadPoolExecutor

def download(url, filename):
    ... your dowload function...
    pass

with ThreadPoolExecutor(max_workers=12) as executor:
    future = executor.submit(download, url, filename)
    print(future.result())

submit()函数将任务提交到一个队列。(队列管理已经为您完成)

Python version 3.5 and above:
if max_workers is None or not given, it will default to the number of processors on the 
machine, multiplied by 5.

您可以设置max_workers，实际上是CPU核心数量的几倍，根据上下文切换开销，进行一些测试，看看faw可以提高多少。

有关更多信息，请访问：https://docs.python.org/3/library/concurrent.futures.html

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/10505654

复制

相似问题

问多线程，加快下载速度
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问多线程，加快下载速度EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问多线程，加快下载速度
EN