文章/答案/技术大牛

发布

社区首页 >问答首页 >SGE中SGI机器上的cpu负载失控

问SGE中SGI机器上的cpu负载失控
EN

Stack Overflow用户

提问于 2016-09-24 01:45:48

回答 1查看 41关注 0票数 1

我们在sgi UV2000 (smp)上运行oge 2011.11，具有256个超线程内核(128个物理内核)。当我们在系统上运行openmp作业时，它运行得很好。这就是工作：

#include <iostream>
#include <cstring>
#include <cstdlib>
#include <math.h>
#include <omp.h>

using namespace std;

int main (
        int argc,
        char* argv[] ) {


#if _OPENMP
    // Show how many threads we have available
    int max_t = omp_get_max_threads();
    cout << "OpenMP using up to " << max_t << " threads" << endl;
#else
    cout << "!!!ERROR!!! Program not compiled for OpenMP" << endl;
    return -1;
#endif

    const long N = 115166;
    const long bytesRequested = N * N * sizeof(double);

    cout << "Allocating " << bytesRequested << " bytes for matrix" <<     endl;

    double* S = new double[ N * N ];

    if( NULL == S ) {
        cout << "!!!ERROR!!! Failed to allocate " << bytesRequested << "         bytes" << endl;
        return -1;
    }

    cout << "Entering main loop" << endl;

#pragma omp parallel for schedule(static)
    for ( long i = 0; i < N - 1; i++ ) {
        for ( long j = i + 1; j < N; j++ ) {
#if _OPENMP
            int tid=omp_get_thread_num();
            if( 0 == i && 1 == j ) {
                int nThreads=omp_get_num_threads();
                cout << "OpenMP loop using " << nThreads << " threads" <<     endl;
            }
#endif

            S[ i * N + j ] = sqrt( i + j );
        }
    }

    cout << "Loop completed" << endl;
    delete S;
    return 0;
}

下面是它正在执行的代码：

c++$。/ loop使用最多256线程为矩阵分配106105660448字节的测试OpenMP进入主循环OpenMP循环使用256线程循环完成

然而，当我在队列中使用以下(也是目前为止的)并行环境提交它时，cpu上的负载猛增(远远超过256)，系统变得完全没有响应，必须重新启动电源。下面是我的pe环境：

c++$ qconf -sp threaded pe_name线程插槽10000 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary TRUE

我已经改变了control_slaves，job_is_first_task，插槽(减少到140以下，超过140的任何东西，我得到了前面描述的失控负载条件)我甚至使用了我创建的不同的并行环境。我还将队列中的插槽数量减少到140个，但是负载仍然会跑开并锁定机器。最后，我尝试了多次迭代，但下面是我的qsub脚本：

#!/bin/sh
#$ -cwd
#$ -q sgi-test
## email on a - abort, b - begin, e - end
#$ -m abe
#$ -M <email address>
#source ~/.bash_profile
## for this job, specifying the threaded environment w a "-" ensures the             max number of processors is used
#$ -pe threaded -
echo "slots = $NSLOTS"
export OMP_NUM_THREADS=$NSLOTS
echo "OMP_NUM_THREADS=$OMP_NUM_THREADS"
echo "Running on host=$HOSTNAME"
## memory resource request per thread, max 24 for 32 threads
#$ -l h_vmem=4G
##$ -V
##this environment variable setting is needed only for OpenMP-parallelized     applications
## finally! -- run your process
<path>/OMPtest

最后，由于无限的处理器/插槽总是会使机器崩溃，我指定了：

    #$ -pe threaded 139

任何大于139的值都会使机器崩溃，但是mcelog或/var/log/messages中没有输出。任何对可能发生的事情的洞察都将不胜感激！

linux

multithreading

c++

回答 1

Stack Overflow用户

发布于 2016-11-29 05:20:46

我自己解决的。在脚本中添加了"-V“选项，以将我的环境变量推送到oge/sge，因为作业在我的环境中在调度器之外运行得很好。它每次运行时都没有崩溃。我可以通过消除/试验和错误的过程来追踪导致问题的变量，但我有很多变量。总而言之，"-V“修复了很多问题，特别是如果您的作业在OGE/SGE之外运行良好的话。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39666681

复制

相似问题

问SGE中SGI机器上的cpu负载失控
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问SGE中SGI机器上的cpu负载失控EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问SGE中SGI机器上的cpu负载失控
EN