我正在尝试在运行ubuntu16.04的集群中安装slurm。
我正在使用英特尔mpi,安装目录位于head节点/opt/ intel /impi_5.01。
根据slurm指令,它需要导出libpmi.so变量。https://slurm.schedmd.com/mpi_guide.html#intel_mpi
但是,我通过ubuntu安装了slurm-llnl
sudo apt-get slurm-llnl我不知道libpmi.so在哪里?所以,我搜索了一下,在这里找到了一个文件,这是我要找的文件吗?
/usr/lib/x86_64-linux-gnu/libpmi.so总之,我导出了变量,并尝试了
srun -p old -N3 -n24 hostname它回来了,
rolly@head:~$ srun -p old -N3 -n24 hostname
node02
node02
node02
node02
node02
node02
node02
node02
node01
node01
head
head
node01
head
head
head
node01
node01
head
node01
head
head
node01
node01看上去起作用了。
但在我执行任务的时候,
srun -p old -N3 -n24 ~/QE530-CPU/espresso-5.3.0/bin/pw.x它产生了错误,
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)我认为错误提示是由于使用intel-mpi运行mpiexec,所以应该使用mpirun。
我怎样才能纠正这个问题?
谢谢!
发布于 2017-01-15 15:41:27
我找到了解决办法。
1) sudo apt-get install mpich
2) srun --mpi=pmi2
3)正确加载mkl和英特尔相关环境变量。
我希望这能帮助有类似问题的人。
https://askubuntu.com/questions/872091
复制相似问题