Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] trouble using openmpi under slurm
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-07-06 14:38:34


It would really help if you told us what version of OMPI you are using, and what version of SLURM.

On Jul 6, 2010, at 12:16 PM, David Roundy wrote:

> Hi all,
>
> I'm running into trouble running an openmpi job under slurm. I
> imagine the trouble may be in my slurm configuration, but since the
> error itself involves mpirun crashing, I thought I'd best ask here
> first. The error message I get is:
>
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
> This shows up when I run my MPI job with the following script:
>
> #!/bin/sh
> set -ev
> hostname
> mpirun pw.x < pw.in > pw.out 2> errors_pw
> (end of submit.sh)
>
> if I submit using
>
> sbatch -c 2 submit.sh
>
> If I use "-N 2" instead of "-c 2", the job runs fine, but runs on two
> separate nodes, rather than two separate cores on a single node (which
> makes it extremely slow). I know that the problem is related somehow
> to the environment variables that are passed to openmpi by slurm,
> since I can fix the crash by changing my script to read:
>
> #!/bin/sh
> set -ev
> hostname
> # clear SLURM environment variables
> for i in `env | awk -F= '/SLURM/ {print $1}' | grep SLURM`; do
> echo unsetting $i
> unset $i
> done
> mpirun -np 2 pw.x < pw.in > pw.out 2> errors_pw
>
> So you can see that I just clear all the environment variables and
> then specify the number of processors to use manually. I suppose I
> could use a bisection approach to figure out which environment
> variable is triggering this crash, and then could either edit my
> script to just modify that variable, or could figure out how to make
> slurm pass things differently. But I thought that before entering
> upon this laborious process, it'd be worth asking on the list to see
> if anyone has a suggestion as to what might be going wrong? I'll be
> happy to provide my slurm config (or anything else that seems useful)
> if you think that would be helpful!
> --
> David Roundy
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users