我试图编写一个混合的OpenMP/MPI-程序,因此我试图理解OpenMP-线程和MPI-进程的数量之间的相关性。因此,我创建了一个小测试程序:
#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>
int main(int args, char *argv[]) {
int rank, nprocs, thread_id, nthreads, cxx_procs;
MPI_Init(&args, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel private(thread_id, nthreads, cxx_procs)
{
thread_id = omp_get_thread_num();
nthreads = omp_get_num_threads();
cxx_procs = std::thread::hardware_concurrency();
std::stringstream omp_stream;
omp_stream << "I'm thread " << thread_id
<< " out of " << nthreads
<< " on MPI process nr. " << rank
<< " out of " << nprocs
<< ", while hardware_concurrency reports " << cxx_procs
<< " processors\n";
std::cout << omp_stream.str();
}
MPI_Finalize();
return 0;
}它是使用
mpicxx -fopenmp -std=c++17 -o omp_mpi source/main.cpp -lgompgcc-9.3.1和OpenMPI 3。现在,当使用./omp_mpi在带有4c/8t的i7-6700上执行它时,我得到以下输出
I'm thread 1 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors即如预期那样。
在使用mpirun -n 1 omp_mpi执行它时,我会期望同样的情况,但是我得到了
I'm thread 0 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors其他的线在哪里?在两个MPI进程上执行它时,我得到
I'm thread 0 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 0 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors也就是说,仍然只有两个OpenMP线程,但是当在四个MPI进程上执行它时,我得到
I'm thread 1 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors现在,我突然得到8个OpenMP-每个MPI线程-进程。这种变化从何而来?
发布于 2021-02-19 11:08:22
您正在观察OpenMP的特性与GNU OpenMP运行时libgomp之间的交互。
首先,OpenMP中的线程数由num-internal(内部控制变量)控制,设置它的方法是调用omp_set_num_threads()或在环境中设置OMP_NUM_THREADS。当OMP_NUM_THREADS未设置且没有调用omp_set_num_threads()时,运行库可以自由选择它认为合理的默认设置。在libgomp的例子中,手册说:
OMP_NUM_THREADS指定在并行区域中使用的默认线程数。这个变量的值应该是一个逗号分隔的正整数列表;该值指定用于相应嵌套级别的线程数。默认情况下,在列表中指定多个项将自动启用嵌套。如果每个CPU使用一个未定义的线程,则为。。
它没有提到的是,它使用各种启发式方法来确定正确数量的CPU。在Linux上,进程关联掩码用于这一点(如果您喜欢阅读代码,用于Linux的代码是就在这里)。如果进程绑定到单个逻辑CPU,则只能得到一个线程:
$ taskset -c 0 ./omp_mpi
I'm thread 0 out of 1 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors如果将其绑定到多个逻辑CPU,则使用它们的计数:
$ taskset -c 0,2,5 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors这个特定于libgomp的行为与另一个特定于Open的行为交互。早在2013年,Open就改变了它的默认绑定策略。其原因在某种程度上是技术原因和政治因素的混合,您可以在杰夫·斯奎尔的博客上阅读更多信息(Jeff是核心的Open开发人员)。
这个故事的寓意是:
总是显式地设置OpenMP线程数和MPI绑定策略。
$ mpiexec -n 2 --map-by node:PE=3 --bind-to core -x OMP_NUM_THREADS=3 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors请注意,我启用了超线程,因此--bind-to core和--bind-to hwthread在不显式设置OMP_NUM_THREADS的情况下产生不同的结果。
mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processorsvs
mpiexec -n 2 --map-by node:PE=3 --bind-to hwthread ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors--map-by node:PE=3给每个MPI排序为每个节点三个处理元素(PEs)。当与核心绑定时,PE就是核心。当绑定到硬件线程时,PE是一个线程,在我的示例中应该使用--map-by node:PE=#cores*#threads,即--map-by node:PE=6。
OpenMP运行时是否尊重MPI设置的关联掩码,以及它是否将自己的线程关联映射到它上,以及如果不这样做,则是一个完全不同的故事。
发布于 2021-02-18 17:36:15
mpirun的手册页解释道:
如果您只是在寻找如何运行MPI应用程序,您可能希望使用以下形式的命令行: % mpirun -np X 这将在当前运行时环境中运行X个副本(.) 请注意,mpirun会在v1.8系列开始时自动绑定进程。在没有任何进一步指令的情况下,使用了三种绑定模式: 绑定到核心:当进程数为<= 2时,绑定到套接字:当进程数>2时,绑定到任何进程:当超额订阅时 如果您的应用程序使用线程,那么您可能希望确保您根本没有绑定(通过指定--绑定--无绑定),或者使用适当的绑定级别或每个应用程序进程的处理元素的特定数量绑定到多个核心。
现在,如果指定1或2个MPI进程,mpirun默认为--bind-to core,这将导致每个MPI进程有2个线程。但是,如果您指定了4个MPI进程,mpirun默认为--bind-to socket,并且每个进程有8个线程,因为您的计算机是一个单套接字线程。我在膝上型电脑(1s/2c/4t)和工作站(2个套接字,每个套接字12个核心,每个内核2个线程)上测试了它,程序(没有np参数)的行为符合上面的规定:对于工作站,有24个MPI进程,每个OpenMP线程都有24个。
https://stackoverflow.com/questions/66262096
复制相似问题