当我在集群上的512进程上运行这段简单的代码时,我的MPI代码会死锁。我离记忆的极限很远。如果我将进程的数量增加到2048 (对于这个问题来说太多了),代码就会再次运行。死锁发生在包含MPI_File_write_all的行中。
有什么建议吗?
int count = imax*jmax*kmax;
// CREATE THE SUBARRAY
MPI_Datatype subarray;
int totsize [3] = {kmax, jtot, itot};
int subsize [3] = {kmax, jmax, imax};
int substart[3] = {0, mpicoordy*jmax, mpicoordx*imax};
MPI_Type_create_subarray(3, totsize, subsize, substart, MPI_ORDER_C, MPI_DOUBLE, &subarray);
MPI_Type_commit(&subarray);
// SET THE VALUE OF THE GRID EQUAL TO THE PROCESS ID FOR CHECKING
if(mpiid == 0) std::printf("Setting the value of the array\n");
for(int i=0; i<count; i++)
u[i] = (double)mpiid;
// WRITE THE FULL GRID USING MPI-IO
if(mpiid == 0) std::printf("Write the full array to disk\n");
char filename[] = "u.dump";
MPI_File fh;
if(MPI_File_open(commxy, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY | MPI_MODE_EXCL, MPI_INFO_NULL, &fh))
return 1;
// select noncontiguous part of 3d array to store the selected data
MPI_Offset fileoff = 0; // the offset within the file (header size)
char name[] = "native";
if(MPI_File_set_view(fh, fileoff, MPI_DOUBLE, subarray, name, MPI_INFO_NULL))
return 1;
if(MPI_File_write_all(fh, u, count, MPI_DOUBLE, MPI_STATUS_IGNORE))
return 1;
if(MPI_File_close(&fh))
return 1;发布于 2012-09-15 19:06:30
您的代码在快速检查时看起来很正确。我建议您让MPI库告诉您出了什么问题:为什么不至少显示错误呢?以下是一些可能有用的代码:
static void handle_error(int errcode, char *str)
{
char msg[MPI_MAX_ERROR_STRING];
int resultlen;
MPI_Error_string(errcode, msg, &resultlen);
fprintf(stderr, "%s: %s\n", str, msg);
MPI_Abort(MPI_COMM_WORLD, 1);
}MPI_SUCCESS保证为0吗?我宁愿看到
errcode = MPI_File_routine();
if (errcode != MPI_SUCCESS) handle_error(errcode, "MPI_File_open(1)");
把它放进去,如果你在做一些棘手的事情,比如设置一个带有偏移的文件视图,而这些偏移不是单调的、不递减的,那么错误字符串可能会提示出问题所在。
https://stackoverflow.com/questions/12438727
复制相似问题