首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >mpi + infiniband太多连接

mpi + infiniband太多连接
EN

Stack Overflow用户
提问于 2014-10-26 18:19:47
回答 2查看 1.8K关注 0票数 2

我在集群上运行一个MPI应用程序,使用4个节点,每个节点有64个核心。应用程序执行对所有通信模式的所有操作。

通过以下方式执行应用程序运行良好:

$:mpirun -npernode 36 ./Application

在每个节点添加一个进一步的进程使应用程序崩溃:

$:mpirun -npernode 37 ./Application

代码语言:javascript
复制
--------------------------------------------------------------------------
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:             laser045
Local device:           qib0
Queue pair type:        Reliable connected (RC)
--------------------------------------------------------------------------
[laser045:15359] *** An error occurred in MPI_Issend
[laser045:15359] *** on communicator MPI_COMM_WORLD
[laser045:15359] *** MPI_ERR_OTHER: known error not in list
[laser045:15359] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[laser040:49950] [[53382,0],0]->[[53382,1],30] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 163]
[laser040:49950] [[53382,0],0]->[[53382,1],21] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 154]
--------------------------------------------------------------------------
mpirun has exited due to process rank 128 with PID 15358 on
node laser045 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[laser040:49950] 4 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed
[laser040:49950] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[laser040:49950] 4 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal

编辑在所有通信模式中都添加了一些源代码:

代码语言:javascript
复制
// Send data to all other ranks
for(unsigned i = 0; i < (unsigned)size; ++i){
    if((unsigned)rank == i){
        continue;
    }

    MPI_Request request;
    MPI_Issend(&data, dataSize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &request);
    requests.push_back(request);
}

// Recv data from all other ranks
for(unsigned i = 0; i < (unsigned)size; ++i){
    if((unsigned)rank == i){
       continue;
    }

    MPI_Status status;
    MPI_Recv(&recvData, recvDataSize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
}

// Finish communication operations
for(MPI_Request &r: requests){
    MPI_Status status;
    MPI_Wait(&r, &status);
}

作为集群用户,我可以做些什么,或者我可以给集群管理员一些建议吗?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-10-27 13:06:14

此错误连接到此处注释的mpi消息队列的缓冲区大小:

http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc

以下环境设置解决了我的问题:

出口OMPI_MCA_btl_openib_receive_queues="P,128,256,192,128:S,65536,256,192,128" $

票数 2
EN

Stack Overflow用户

发布于 2014-10-31 14:49:09

行mca_oob_tcp_msg_send_handler错误行可能指示对应于接收级别的节点死亡(内存耗尽或接收到SIGSEGV):

http://www.open-mpi.org/faq/?category=tcp#tcp-connection-errors

开放MPI中的OOB (带外)框架用于控制消息,而不是应用程序的消息。实际上,消息通常要经过字节传输层(BTL),例如self、sm、vader、openib (Infiniband)等等。

在这方面,“ompi_info-a”的输出是有用的。

最后,问题中没有指定Infiniband硬件供应商是Mellanox,因此XRC选项可能无法工作(例如,Intel/不支持此选项)。

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/26576329

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档