如何同时下载多个链接?我下面的脚本可以工作,但一次只能下载一个,而且速度非常慢。我不知道如何在我的脚本中加入多线程。
Python脚本:
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'w').write(converted)
print(name)名为links.html的文件
<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>发布于 2012-05-09 04:14:03
在我看来,这是消费者-生产者的问题--参见维基百科
您可以使用
import Queue, thread
# create a Queue.Queue here
queue = Queue.Queue()
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
queue.put(url) # produce
def thrad():
url = queue.get() # consume
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'wb').write(converted)
print(name)
thread.start_new(thrad, ()) # start 1 threads发布于 2012-05-09 04:27:04
我使用multiprocessing进行并行化--出于某种原因,我比threading更喜欢它
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
import multiprocessing
print ("downloading and parsing Bibles...")
def download_stuff(link):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'w').write(converted)
print(name)
root = html.parse(open('links.html'))
links = root.findall('//a')
pool = multiprocessing.Pool(processes=5) #use 5 processes to download the data
output = pool.map(download_stuff,links) #output is a list of [None,None,...] since download_stuff doesn't return anything发布于 2017-12-30 19:30:34
在2017年,现在还有其他一些选择,比如asyncio和ThreadPoolExecutor。
下面是一个ThreadPoolExecutor的例子(包含在未来的Python中)
from concurrent.futures import ThreadPoolExecutor
def download(url, filename):
... your dowload function...
pass
with ThreadPoolExecutor(max_workers=12) as executor:
future = executor.submit(download, url, filename)
print(future.result())submit()函数将任务提交到一个队列。(队列管理已经为您完成)
Python version 3.5 and above:
if max_workers is None or not given, it will default to the number of processors on the
machine, multiplied by 5.您可以设置max_workers,实际上是CPU核心数量的几倍,根据上下文切换开销,进行一些测试,看看faw可以提高多少。
有关更多信息,请访问:https://docs.python.org/3/library/concurrent.futures.html
https://stackoverflow.com/questions/10505654
复制相似问题