在我的新工作中,我管理一个集群,该集群使用torque作为资源管理器,使用maui作为调度器。
目前,我正面临着这样一个反复出现的问题,即特定用户的作业总是被发送到调试队列。以下是系统上活动队列的列表:
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
debug -- -- 00:20:00 -- 0 0 12 E R
intel -- -- -- -- 0 0 -- E R
medium -- -- 72:00:00 -- 0 0 12 E R
bighuge -- -- -- -- 0 0 -- E R
long -- -- -- -- 0 0 12 E R
----- -----
0 0用户提交的作业的Wall-time是以小时为单位的,所以我不明白为什么它会被发送到调试队列。
此外,下面是tracejob的输出:
04/08/2016 15:46:48 S enqueuing into intel, state 1 hop 1
04/08/2016 15:46:48 S dequeuing from intel, state QUEUED
04/08/2016 15:46:48 S enqueuing into debug, state 1 hop 1
04/08/2016 15:46:48 S Job Queued at request of dawn@cm01, owner = dawn@cm01, job name = run01_submit.script, queue =
debug
04/08/2016 15:46:49 S Job Run at request of root@cm01
04/08/2016 15:46:49 S child reported success for job after 0 seconds (dest=n20), rc=0
04/08/2016 15:46:49 S preparing to send 'b' mail for job 15631.cm01 to dawn@cm01 (---)
04/08/2016 15:46:49 S Not sending email: User does not want mail of this type.
04/08/2016 15:46:49 S obit received - updating final job usage info
04/08/2016 15:46:49 S job exit status 1 handled
04/08/2016 15:46:49 S preparing to send 'e' mail for job 15631.cm01 to dawn@cm01 (Exit_status=1
04/08/2016 15:46:49 S Not sending email: User does not want mail of this type.
04/08/2016 15:46:49 S Exit_status=1 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
04/08/2016 15:46:49 S on_job_exit task assigned to job
04/08/2016 15:46:49 S req_jobobit completed
04/08/2016 15:46:49 S JOB_SUBSTATE_EXITING
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49 S about to copy stdout/stderr/stageout files
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEOUT
04/08/2016 15:46:49 S JOB_SUBSTATE_STAGEDEL
04/08/2016 15:46:49 S JOB_SUBSTATE_EXITED
04/08/2016 15:46:49 S JOB_SUBSTATE_COMPLETE
04/08/2016 15:50:54 S Request invalid for state of job COMPLETE
04/08/2016 15:51:00 S Request invalid for state of job COMPLETE
04/08/2016 15:51:49 S dequeuing from debug, state COMPLETE现在的解决方法是使用qalter命令手动更改为作业分配的队列。
有什么想法吗?
发布于 2016-05-21 10:10:54
因为作业会立即从英特尔队列跳转到调试,所以我怀疑您已经在qmgr或Maui中配置了自动路由。如果英特尔队列被配置为路由队列,这就可以解释了。
运行qmgr -c "print queue intel"进行检查。
如果它不是路由队列,您可能可以增加loglevel,以便更好地查看pbs_server日志中发生的情况。
当我以这种方式创建路由队列时,我会在提交作业时获得相同类型的跟踪作业输出:
05/20/2016 20:04:05.439 S enqueuing into route, state 1 hop 1 05/20/2016 20:04:05.440 S dequeuing from route, state QUEUED 05/20/2016 20:04:05.440 S enqueuing into test, state 1 hop 1 05/20/2016 20:04:05.737 S Job Run at request of root@testserver
否则,请检查Maui配置和日志以获取线索。
https://stackoverflow.com/questions/36510690
复制相似问题