Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] trouble using openmpi under slurm
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-07-06 15:31:40


Thanks - that helps.

As you note, the issue is that OMPI doesn't support the core-level allocation options of slurm - never has, probably never will. What I found interesting, though, was that your envars don't anywhere indicate that this is what you requested. I don't see anything there that would case the daemon to crash.

So I'm left to guess that this is an issue where slurm doesn't like something OMPI does because it violates that core-level option. Can you add --display-devel-map to your mpirun command? It would be interesting to see where it thinks the daemon should go.

Just to check - the envars you sent in your other note came from the sbatch -c 2 run, yes?

On Jul 6, 2010, at 12:42 PM, David Roundy wrote:

> Ah yes,
>
> It's the versions of each that are packaged in debian testing, which
> are openmpi 1.4.1 and slurm 2.1.9.
>
> David
>
> On Tue, Jul 6, 2010 at 11:38 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>> It would really help if you told us what version of OMPI you are using, and what version of SLURM.
>>
>>
>> On Jul 6, 2010, at 12:16 PM, David Roundy wrote:
>>
>>> Hi all,
>>>
>>> I'm running into trouble running an openmpi job under slurm. I
>>> imagine the trouble may be in my slurm configuration, but since the
>>> error itself involves mpirun crashing, I thought I'd best ask here
>>> first. The error message I get is:
>>>
>>> --------------------------------------------------------------------------
>>> All nodes which are allocated for this job are already filled.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>>
>>> This shows up when I run my MPI job with the following script:
>>>
>>> #!/bin/sh
>>> set -ev
>>> hostname
>>> mpirun pw.x < pw.in > pw.out 2> errors_pw
>>> (end of submit.sh)
>>>
>>> if I submit using
>>>
>>> sbatch -c 2 submit.sh
>>>
>>> If I use "-N 2" instead of "-c 2", the job runs fine, but runs on two
>>> separate nodes, rather than two separate cores on a single node (which
>>> makes it extremely slow). I know that the problem is related somehow
>>> to the environment variables that are passed to openmpi by slurm,
>>> since I can fix the crash by changing my script to read:
>>>
>>> #!/bin/sh
>>> set -ev
>>> hostname
>>> # clear SLURM environment variables
>>> for i in `env | awk -F= '/SLURM/ {print $1}' | grep SLURM`; do
>>> echo unsetting $i
>>> unset $i
>>> done
>>> mpirun -np 2 pw.x < pw.in > pw.out 2> errors_pw
>>>
>>> So you can see that I just clear all the environment variables and
>>> then specify the number of processors to use manually. I suppose I
>>> could use a bisection approach to figure out which environment
>>> variable is triggering this crash, and then could either edit my
>>> script to just modify that variable, or could figure out how to make
>>> slurm pass things differently. But I thought that before entering
>>> upon this laborious process, it'd be worth asking on the list to see
>>> if anyone has a suggestion as to what might be going wrong? I'll be
>>> happy to provide my slurm config (or anything else that seems useful)
>>> if you think that would be helpful!
>>> --
>>> David Roundy
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> David Roundy
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users