I am unable to run batch jobs with my installation of OpenMPI and SLURM. Indeed
I am not sure if this is an OpenMPI issue or a SLURM issue, but here is what is
happening on my little cluster (3 nodes, one login node and 2 backend nodes
with 2 dual core CPUs each). If I run
salloc -n 8 mpirun -np 8 myprog
I get both backend nodes allocated (with their total of 8 cores) and myprog runs
if I run
sbatch -n 8 zrun.sh
where zrun.sh contains
mpirun -np 8 myprog
again both backend nodes get allocated, but the job does not run. In top I see
one mpirun and two srun processes on the first backend node, but they just seem
to be sitting there. On the other backend node I see no mpirun, srun or
anything else which might have been started as a result of the batch job.
Is this the correct way to initiate SLURM batch jobs with OpenMPI?
I also see the following error in the SLURM log of the second backnode
May 26 16:15:21 localhost slurmd: launch task 82.0 request from
1001.1001_at_127.0.0.1 (port 21721)
May 26 16:15:21 localhost slurmstepd: jobacct NONE plugin loaded
May 26 16:15:21 localhost slurmstepd: error: connect io: Connection
May 26 16:15:21 localhost slurmd[node21]: error: IO setup failed:
May 26 16:15:21 localhost slurmd[node21]: error: job_manager exiting
abnormally, rc = 4020
May 26 16:15:21 localhost slurmd[node21]: done with job
The job number assigned by SLURM at the submissin was 82.
What am I doing incorrectly? Is it possible that something in my environment is
not set up correctly?