文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用joblib并行化？

问如何使用joblib并行化？
EN

Stack Overflow用户

提问于 2019-05-02 16:13:28

回答 1查看 591关注 0票数 2

因此，我使用[医]枕(氏)窝过滤大图像，我想并行处理单个图像所做的不同的过滤。对于并行化，我想使用强权b。然而，我被两个结果所困扰：

对于多处理后端，任务要慢得多(1.5倍)
使用多线程后端，任务会更快(25%的速度)

我对这两个结果感到惊讶，因为我确信卷积是CPU约束的。

这里是我在jupyter笔记本中用来计算运行时的代码：

from joblib import Parallel, delayed
import numpy as np
from scipy.signal import fftconvolve

im_size = (512, 512)
filter_size = tuple(s-1 for s in im_size)
n_filters = 3
image = np.random.rand(*im_size)
filters = [np.random.rand(*filter_size) for i in range(n_filters)]

%%timeit
s = np.sum(
    Parallel(n_jobs=n_filters, backend='multiprocessing')(
        delayed(fftconvolve)(image, f) for f in filters
    )
)

283 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
s = np.sum(
    Parallel(n_jobs=n_filters, backend='threading')(
        delayed(fftconvolve)(image, f) for f in filters
    )
)

142 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
s = np.sum([fftconvolve(image, f) for f in filters])

198 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我也尝试过不同的方法，比如把图像放在备忘录中，或者减少预先分派的任务，但是没有从根本上改变结果。

为什么多线程时多进程不加速计算？

convolution

joblib

python

scipy

multiprocessing

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-05-07 14:51:53

基准测试并行处理的问题是，您必须适当地考虑代码中造成的开销，才能得到正确的结论。在使用并行处理时，有三个开销来源：

生成线程或进程:这是每次调用Parallel时都要做的事情，除非您依赖托管Parallel对象(带有with上下文)或使用loky后端。有关更多信息，请参见这里。
在新解释器中导入模块:对于依赖于新进程的后端(当开始方法不是fork)，需要重新导入所有模块。这会造成开销。
进程之间的通信:当使用进程(所以不使用backend=threading)时，您需要与每个工作人员通信数组。通信可以减慢计算速度，特别是对于具有大输入(如fftconvolve )的短任务。

如果您的目标是多次调用此函数，则应该修改基准测试，以实际删除为Parallel对象生成工作人员的成本，方法是使用托管Parallel对象或依赖于backend=loky的此功能。并避免由于加载模块而造成的开销：

from joblib import Parallel, delayed
import numpy as np
from scipy.signal import fftconvolve

from time import time, sleep


def start_processes(im, filter, mode=None, delay=0):
    sleep(delay)
    return im if im is not None else 0


def time_parallel(name, parallel, image, filters, n_rep=50):
        print(80*"=" + "\n" + name + "\n" + 80*"=")

        # Time to start the pool of workers and initialize the processes
        # With this first call, the processes/threads are actually started
        # and further calls will not incure this overhead anymore
        t0 = time()
        np.sum(parallel(
            delayed(start_processes)(image, f, mode='valid') for f in filters)
        )
        print(f"Pool init overhead: {(time() - t0) / 1e-3:.3f}ms")

        # Time the overhead due to loading of the scipy module
        # With this call, the scipy.signal module is loaded in the child
        # processes. This import can take up to 200ms for fresh interpreter.
        # This overhead is only present for the `loky` backend. For the
        # `multiprocessing` backend, as the processes are started with `fork`,
        # they already have a loaded scipy module. For the `threading` backend
        # and the iterative run, there no need to re-import the module so this
        # overhead is non-existent
        t0 = time()
        np.sum(parallel(
            delayed(fftconvolve)(image, f, mode='valid') for f in filters)
        )
        print(f"Library load overhead: {(time() - t0) / 1e-3:.3f}ms")

        # Average the runtime on multiple run, once the external overhead have
        # been taken into account.
        times = []
        for _ in range(n_rep):
            t0 = time()
            np.sum(parallel(
                delayed(fftconvolve)(image, f, mode='valid') for f in filters
            ))
            times.append(time() - t0)
        print(f"Runtime without init overhead: {np.mean(times) / 1e-3:.3f}ms,"
              f" (+-{np.std(times) / 1e-3:.3f}ms)\n")


# Setup the problem size
im_size = (512, 512)
filter_size = tuple(5 for s in im_size)
n_filters = 3
n_jobs = 3
n_rep = 50

# Generate random data
image = np.random.rand(*im_size)
filters = np.random.rand(n_filters, *filter_size)


# Time the `backend='multiprocessing'`
with Parallel(n_jobs=n_jobs, backend='multiprocessing') as parallel:
    time_parallel("Multiprocessing", parallel, image, filters, n_rep=n_rep)
sleep(.5)

# Time the `backend='threading'`
with Parallel(n_jobs=n_jobs, backend='threading') as parallel:
    time_parallel("Threading", parallel, image, filters, n_rep=n_rep)

sleep(.5)


# Time the `backend='loky'`.
# For this backend, there is no need to rely on a managed `Parallel` object
# as loky reuses the previously created pool by default. We will thus mimique
# the creation of a new `Parallel` object for each repetition
def parallel_loky(it):
    Parallel(n_jobs=n_jobs)(it)


time_parallel("Loky", parallel_loky, image, filters, n_rep=n_rep)
sleep(.5)


# Time the iterative run.
# We rely on the SequentialBackend of joblib which is used whenever `n_jobs=1`
# to allow using the same function. This should not change the computation
# much.
def parallel_iterative(it):
    Parallel(n_jobs=1)(it)


time_parallel("Iterative", parallel_iterative, image, filters, n_rep=n_rep)

$ python main.py 
================================================================================
Multiprocessing
================================================================================
Pool init overhead: 12.112ms
Library load overhead: 96.520ms
Runtime without init overhead: 77.548ms (+-16.119ms)

================================================================================
Threading
================================================================================
Pool init overhead: 11.887ms
Library load overhead: 76.858ms
Runtime without init overhead: 31.931ms (+-3.569ms)

================================================================================
Loky
================================================================================
Pool init overhead: 502.369ms
Library load overhead: 245.368ms
Runtime without init overhead: 44.808ms (+-4.074ms)

================================================================================
Iterative
================================================================================
Pool init overhead: 1.048ms
Library load overhead: 92.595ms
Runtime without init overhead: 47.749ms (+-4.081ms)

使用此基准测试，您可以看到，一旦启动了loky后端，使用它实际上会更快。但是如果您不多次使用它，开销就太大了。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55956447

复制

相似问题

问如何使用joblib并行化？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用joblib并行化？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用joblib并行化？
EN