在我们的集群中,当我提交请求超过40个节点或640个核心的作业时,$LSB_HOSTS变得空了,所以作业停止了。我使用这个变量生成一个nodelist文件,我在mpirun命令行中使用该文件如下:
#BSUB -q cpu
#BSUB -J gromacs
#BSUB -o job.out
#BSUB -e job.err
#BSUB -n 640
#####################################################
#####################################################
INPUT=test184000atoms_verlet.tpr
echo ""
echo "----------------------- INTIALIZATIONS -----------------------------"
echo ""
source /lustre/utility/intel/composer_xe_2013.3.163/bin/compilervars.sh intel64
source /lustre/utility/intel/mkl/bin/intel64/mklvars_intel64.sh
source /lustre/utility/intel/impi/4.1.1.036/bin64/mpivars.sh
MPIRUN=/path/to/intel/impi/4.1.1.036/intel64/bin/mpirun
EXE=mdrun_mpi
if test ! -x `which $EXE` ; then
echo
echo "ERROR: `which $EXE` not existent or not executable"
echo "Aborting"
exit 1
fi
CURDIR=$PWD
cd $CURDIR
rm -f nodelist >& /dev/null
touch nodelist
for host in `echo $LSB_HOSTS`
do
echo $host >> nodelist
sleep 2
done
NP=`cat nodelist |wc -l`
NN=`cat nodelist |sort |uniq|tee nodes |wc -l`
echo
echo "Executable : `which $EXE`"
echo "Working directory is $CURDIR"
echo "Running on host `hostname`"
echo "Directory is `pwd`"
echo "This jobs runs on $NN nodes"
echo "This job has allocated $NP core(s)"
echo
ulimit -aH
echo
ls -al
echo ""
echo "----------------------- RUN -----------------------------"
echo ""
date '+RUN STARTED ON %m/%d/%y AT %H:%M:%S'
$MPIRUN -np $NN -machinefile nodes $EXE -v -deffnm $INPUT >& $EXE.log
date '+RUN ENDED ON %m/%d/%y AT %H:%M:%S'
echo ""
echo "----------------------- DONE ----------------------------"
echo ""
ls -al这里有什么暗示吗?
你能看出这个剧本有什么问题吗?
谢谢,
埃里克。
发布于 2013-09-16 02:12:45
我终于找到解决办法了。
问题是,由于某种原因,变量LSB_HOSTS有时没有设置。幸运的是,还有一个:LSB_MCPU_HOSTS
对于那些感兴趣的人,以下是我如何使用它:
CURDIR=$PWD
cd $CURDIR
rm -f nodelist nodes n >& /dev/null
touch nodelist
touch nodes
NP=0
for host in `echo $LSB_MCPU_HOSTS | sed -e 's/ /:/g'| sed 's/:n/\nn/g'`
do
echo $host >> nodelist
echo $host | cut -d ":" -f1 >> nodes
nn=`echo $host | cut -d ":" -f2`
NP=`echo $NP+$nn | bc`
done
NN=`cat nodelist | wc -l`
echo
echo "Executable : `which $EXE`"
echo "Working directory is $CURDIR"
echo "Running on host `hostname`"
echo "Directory is `pwd`"
echo "This jobs runs on $NN nodes"
echo "This job has allocated $NP core(s)"
echo谢谢你的帮助。
埃里克。
https://stackoverflow.com/questions/18755582
复制相似问题