我正在尝试提交我的condor作业,但它一直给我一个错误,说:
ERROR: Can't find address of local schedd我是一个初学者,我不太确定这是什么意思。
另外,当我键入condor_q时,我得到以下错误消息:
Error: Can't find address for schedd (name)
Extra Info: You probably saw this error because the condor_schedd is not running on the machine you are trying to query. If the condor_schedd is not running, the Condor system will not be able to find an address and port to connect to and satisfy this request. Please make sure the Condor daemons are running and try again.
Extra Info: If the condor_schedd is running on the machine you are trying to query and you still see the error, the most likely cause is that you have setup a personal Condor, you have not defined SCHEDD_NAME in your condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE setting. You must define either or both of those settings in your config file, or you must use the -name option to condor_q. Please see the Condor manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.有趣的是,condor_status工作得很好(我可以看到所有集群的状态)。
我做了一些研究,它说我需要使用公共目录才能访问它。是否有一个特定的目录用于condor提交/队列?
发布于 2015-06-10 03:50:12
检查condor调度器是否正在运行(您可以使用$ ps aux | grep condor查看机器中的所有condor*进程)
如果sched没有运行,您需要将其添加到中央管理器机器conf中的守护进程列表中(这一行包含一个列表,如MASTER、STARTD、NEGOTIATOR ...)
顺便说一下: condor status工作正常,因为收集器守护进程确实正在运行。
发布于 2018-10-28 15:51:52
这可能与权限错误有关。我也遇到了同样的错误,按照代码行来做,这个问题已经解决了。
mkdir -p /var/run/condor # If it does not exist
mkdir -p /var/lock/condor # If it does not exist
# Recreate them from scratch
sudo rm -rf /var/lib/condor
sudo mkdir -p /var/lib/condor/spool/local_univ_execute
sudo mkdir -p /var/lib/condor/execute
sudo chown -R condor: /var/lib/condor
sudo chmod 1777 /var/lib/condor/spool/local_univ_execute
sudo chmod 1777 /var/lib/condor/execute
mkdir -p /var/log/condor/
sudo chown -R condor: /var/log/condor
sudo chmod 1777 /var/log/condor
# Kill all the condor daemons you have running,
sudo service condor stop
sudo killall condor
sudo killall condor_procd
sudo service condor start # Condor should run as a system service.
$ ps auxwwww | grep condor # You should see all processes run under condor.
condor 7656 0.0 0.2 47508 4644 ? Ss 08:43 0:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
root 7699 0.2 0.1 24384 3920 ? S 08:43 0:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 126
condor 7700 0.0 0.2 47004 5436 ? Ss 08:43 0:00 condor_shared_port -f
condor 7701 0.1 0.3 57252 6620 ? Ss 08:43 0:00 condor_collector -f
condor 7704 0.1 0.3 48352 6816 ? Ss 08:43 0:00 condor_startd -f
condor 7705 0.0 0.3 58052 7188 ? Ss 08:43 0:00 condor_schedd -f
condor 7706 0.0 0.2 47500 5880 ? Ss 08:43 0:00 condor_negotiator -f
$ condor_q # check condor_q works or not
-- Schedd: condor@ebloc : <127.0.0.1:9618?... @ 10/26/18 08:46:06
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended发布于 2020-10-22 22:51:13
对我来说,你不能在交互式作业中提交批处理作业。确保您位于head节点上。
我的头节点:
(automl-meta-learning) miranda9~/automl-meta-learning $ hostname
vision-sched.cs.illinois.edu计算节点:
(automl-meta-learning) miranda9~/automl-meta-learning $ hostname
vision-19.cs.illinois.eduhttps://stackoverflow.com/questions/30722791
复制相似问题