Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SLURM environment variables at runtime
From: Henderson, Brent (brent.henderson_at_[hidden])
Date: 2011-02-24 14:59:30


> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Jeff Squyres
> Sent: Thursday, February 24, 2011 10:20 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] SLURM environment variables at runtime
>
> On Feb 24, 2011, at 11:15 AM, Henderson, Brent wrote:
>
> > Note that the parent of the sleep processes is orted and that orted
> was started by slurmstepd. Unless orted is updating the slurm
> variables for the children (which is doubtful) then they will not
> contain the specific settings that I see when I run srun directly.
>
> I'm not sure what you mean by that statement. The orted passes its
> environment to its children; so whatever the slurm stepd set in the
> environment for the orted, the children should be getting.
>

While you are correct the environment is inherited to the children, sometimes that does not make sense. Take for example SLURM_PROCID. If slurmstepd starts the orted and sets its SLURM_PROCID, then the children sleep processes (of orted) would get that as well exactly as it is in orted. That is clearly misleading at best. For example:

[brent_at_node2 mpi]$ mpirun -np 4 --bynode sleep 300

Then looking at the remote node:

[brent_at_node1 mpi]$ ps -fu brent
UID PID PPID C STIME TTY TIME CMD
brent 2853 2850 0 13:23 ? 00:00:00 /mnt/node1/home/brent/bin/openmpi143/bin/orted -mca
brent 2856 2853 0 13:23 ? 00:00:00 sleep 300
brent 2857 2853 0 13:23 ? 00:00:00 sleep 300
(snip)

And the SLURM_PROCID from each process:

[brent_at_node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2853/environ | egrep ^SLURM_ | grep PROCID
SLURM_PROCID=0
[brent_at_node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2856/environ | egrep ^SLURM_ | grep PROCID
SLURM_PROCID=0
[brent_at_node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2857/environ | egrep ^SLURM_ | grep PROCID
SLURM_PROCID=0
[brent_at_node1 mpi]$

They really can't be all SLURM_PROCID=0 - that is supposed to be unique for the job - right? It appears that the SLURM_PROCID is inherited from the orted parent - which makes a fair amount of sense given how things are launched. If I use HP-MPI, the slurmstepd starts each of the sleep processes and it does set SLURM_PROCID uniquely when launching each child. This is the crux of my issue.

I did find that there are OMPI_* variables that I can map internally back to what I think that the slurm variables should be:

[brent_at_node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2853/environ | egrep ^OMPI | grep WORLD
[brent_at_node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2856/environ | egrep ^OMPI | grep WORLD
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=0
[brent_at_node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2857/environ | egrep ^OMPI | grep WORLD
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=3
OMPI_COMM_WORLD_LOCAL_RANK=1
[brent_at_node1 mpi]$

So, I think if I combined some OMPI_* things with SLURM_* things, I should be o.k. for what I need.

Now to answer the other question - why are there some variables missing. It appears that when the orted processes are launched - via srun but only one per node, it is a subset of the main allocation and thus some of the environment variables are not the same (or missing entirely) as compared to launching them directly with srun on the full allocation. This also makes sense to me at some level, so I'm at peace with it now. :)

> Clearly, something is different here -- maybe we do have a bug -- but
> as you stated below, why does it work for me? Is SLURM 2.2.x the
> difference? I don't know.
>
I'm tempted to try the older version of slurm as this might be the cause of the missing environment variables, but that is an experiment for another day. I'll see if I can make do with what I see currently.

> > Now, the question still is, why does this work for Jeff? :) Is
> there a way to get orted out of the way so the sleep processes are
> launched directly by srun?
>
> Yes; see Ralph's prior mail about direct srun support in Open MPI
> 1.5.x. You lose some functionality / features that way, though.
>
Maybe that will be an answer, but I'll see if I can make things work with 1.4.3 for now.

Last thing before I go. Please let me apologize for not being clear on what I disagreed with Ralph about in my last note. Clearly he nailed the orted launching process and spelled it out very clearly, but I don't believe that HP-MPI is not doing anything special to copy/fix up the SLURM environment variables. Hopefully that was clear by the body of that message.

I think we are done here as I think I can make something work with the various environment variables now. Many thanks to Jeff and Ralph for their suggestions and insight on this issue!

Brent