我们在sgi UV2000 (smp)上运行oge 2011.11,具有256个超线程内核(128个物理内核)。当我们在系统上运行openmp作业时,它运行得很好。这就是工作:
#include <iostream>
#include <cstring>
#include <cstdlib>
#include <math.h>
#include <omp.h>
using namespace std;
int main (
int argc,
char* argv[] ) {
#if _OPENMP
// Show how many threads we have available
int max_t = omp_get_max_threads();
cout << "OpenMP using up to " << max_t << " threads" << endl;
#else
cout << "!!!ERROR!!! Program not compiled for OpenMP" << endl;
return -1;
#endif
const long N = 115166;
const long bytesRequested = N * N * sizeof(double);
cout << "Allocating " << bytesRequested << " bytes for matrix" << endl;
double* S = new double[ N * N ];
if( NULL == S ) {
cout << "!!!ERROR!!! Failed to allocate " << bytesRequested << " bytes" << endl;
return -1;
}
cout << "Entering main loop" << endl;
#pragma omp parallel for schedule(static)
for ( long i = 0; i < N - 1; i++ ) {
for ( long j = i + 1; j < N; j++ ) {
#if _OPENMP
int tid=omp_get_thread_num();
if( 0 == i && 1 == j ) {
int nThreads=omp_get_num_threads();
cout << "OpenMP loop using " << nThreads << " threads" << endl;
}
#endif
S[ i * N + j ] = sqrt( i + j );
}
}
cout << "Loop completed" << endl;
delete S;
return 0;
}下面是它正在执行的代码:
c++$。/ loop使用最多256线程为矩阵分配106105660448字节的测试OpenMP进入主循环OpenMP循环使用256线程循环完成
然而,当我在队列中使用以下(也是目前为止的)并行环境提交它时,cpu上的负载猛增(远远超过256),系统变得完全没有响应,必须重新启动电源。下面是我的pe环境:
c++$ qconf -sp threaded pe_name线程插槽10000 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary TRUE
我已经改变了control_slaves,job_is_first_task,插槽(减少到140以下,超过140的任何东西,我得到了前面描述的失控负载条件)我甚至使用了我创建的不同的并行环境。我还将队列中的插槽数量减少到140个,但是负载仍然会跑开并锁定机器。最后,我尝试了多次迭代,但下面是我的qsub脚本:
#!/bin/sh
#$ -cwd
#$ -q sgi-test
## email on a - abort, b - begin, e - end
#$ -m abe
#$ -M <email address>
#source ~/.bash_profile
## for this job, specifying the threaded environment w a "-" ensures the max number of processors is used
#$ -pe threaded -
echo "slots = $NSLOTS"
export OMP_NUM_THREADS=$NSLOTS
echo "OMP_NUM_THREADS=$OMP_NUM_THREADS"
echo "Running on host=$HOSTNAME"
## memory resource request per thread, max 24 for 32 threads
#$ -l h_vmem=4G
##$ -V
##this environment variable setting is needed only for OpenMP-parallelized applications
## finally! -- run your process
<path>/OMPtest最后,由于无限的处理器/插槽总是会使机器崩溃,我指定了:
#$ -pe threaded 139任何大于139的值都会使机器崩溃,但是mcelog或/var/log/messages中没有输出。任何对可能发生的事情的洞察都将不胜感激!
发布于 2016-11-29 05:20:46
我自己解决的。在脚本中添加了"-V“选项,以将我的环境变量推送到oge/sge,因为作业在我的环境中在调度器之外运行得很好。它每次运行时都没有崩溃。我可以通过消除/试验和错误的过程来追踪导致问题的变量,但我有很多变量。总而言之,"-V“修复了很多问题,特别是如果您的作业在OGE/SGE之外运行良好的话。
https://stackoverflow.com/questions/39666681
复制相似问题