Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SLURM environment variables at runtime
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-02-24 11:05:17


I would talk to the slurm folks about it - I don't know anything about the
internals of HP-MPI, but I do know the relevant OMPI internals. OMPI doesn't
do anything with respect to the envars. We just use "srun -hostlist <fff>"
to launch the daemons. Each daemon subsequently gets a message telling it
what local procs to run, and then fork/exec's those procs. The environment
set for those procs is a copy of that given to the daemon, including any and
all slurm values.

So whatever slurm sets, your procs get.

My guess is that HP-MPI is doing something with the envars to create the
difference.

As for running OMPI procs directly from srun: the slurm folks put out a faq
(or its equivalent) on it, I believe. I don't recall the details (even
though I wrote the integration...). If you google our user and/or devel
mailing lists, though, you'll see threads discussing it. Look for "slurmd"
in the text - that's the ORTE integration module for that feature.

On Thu, Feb 24, 2011 at 8:55 AM, Henderson, Brent <brent.henderson_at_[hidden]>wrote:

> I'm running OpenMPI v1.4.3 and slurm v2.2.1. I built both with the default
> configuration except setting the prefix. The tests were run on the exact
> same nodes (I only have two).
>
> When I run the test you outline below, I am still missing a bunch of env
> variables with OpenMPI. I ran the extra test of using HP-MPI and they are
> all present as with the srun invocation. I don't know if this is my slurm
> setup or not, but I find this really weird. If anyone knows the magic to
> make the fix that Ralph is referring to, I'd appreciate a pointer.
>
> My guess was that there is a subtle way that the launch differs between the
> two products. But, since it works for Jeff, maybe there really is a slurm
> option that I need to compile in or set to make this work the way I want.
> It is not as simple as HP-MPI moving the environment variables itself as
> some of the numbers will change per process created on the remote nodes.
>
> Thanks,
>
> Brent
>
> [brent_at_node2 mpi]$ salloc -N 2
> salloc: Granted job allocation 29
> [brent_at_node2 mpi]$ srun env | egrep ^SLURM_ | head
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=1(x2)
> SLURM_JOB_ID=29
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=1(x2)
> SLURM_JOB_ID=29
> [brent_at_node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
> 66
> [brent_at_node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
> [brent_at_node2 mpi]$ which mpirun
> ~/bin/openmpi143/bin/mpirun
> [brent_at_node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=8(x2)
> SLURM_JOB_ID=29
> SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
> SLURM_JOB_NODELIST=node[1-2]
> SLURM_JOB_CPUS_PER_NODE=8(x2)
> SLURM_JOB_NUM_NODES=2
> SLURM_NODELIST=node[1-2]
> [brent_at_node2 mpi]$ which mpirun
> ~/bin/openmpi143/bin/mpirun
> [brent_at_node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l
> 42 <-- note, not 66 as above!
> [brent_at_node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort >
> mpirun.out
> [brent_at_node2 mpi]$ diff srun.out mpirun.out
> 2d1
> < SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
> 4,5d2
> < SLURM_CPUS_ON_NODE=8
> < SLURM_CPUS_PER_TASK=1
> 8d4
> < SLURM_DISTRIBUTION=cyclic
> 10d5
> < SLURM_GTIDS=1
> 22,23d16
> < SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
> < SLURM_LOCALID=0
> 25c18
> < SLURM_NNODES=2
> ---
> > SLURM_NNODES=1
> 28d20
> < SLURM_NODEID=1
> 31,35c23,24
> < SLURM_NPROCS=2
> < SLURM_NPROCS=2
> < SLURM_NTASKS=2
> < SLURM_NTASKS=2
> < SLURM_PRIO_PROCESS=0
> ---
> > SLURM_NPROCS=1
> > SLURM_NTASKS=1
> 38d26
> < SLURM_PROCID=1
> 40,56c28,35
> < SLURM_SRUN_COMM_HOST=10.0.205.134
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_STEP_ID=2
> < SLURM_STEP_ID=2
> < SLURM_STEPID=2
> < SLURM_STEPID=2
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_NODELIST=node[1-2]
> < SLURM_STEP_NODELIST=node[1-2]
> < SLURM_STEP_NUM_NODES=2
> < SLURM_STEP_NUM_NODES=2
> < SLURM_STEP_NUM_TASKS=2
> < SLURM_STEP_NUM_TASKS=2
> < SLURM_STEP_TASKS_PER_NODE=1(x2)
> < SLURM_STEP_TASKS_PER_NODE=1(x2)
> ---
> > SLURM_SRUN_COMM_PORT=45154
> > SLURM_STEP_ID=5
> > SLURM_STEPID=5
> > SLURM_STEP_LAUNCHER_PORT=45154
> > SLURM_STEP_NODELIST=node1
> > SLURM_STEP_NUM_NODES=1
> > SLURM_STEP_NUM_TASKS=1
> > SLURM_STEP_TASKS_PER_NODE=1
> 59,62c38,40
> < SLURM_TASK_PID=1381
> < SLURM_TASK_PID=2288
> < SLURM_TASKS_PER_NODE=1(x2)
> < SLURM_TASKS_PER_NODE=1(x2)
> ---
> > SLURM_TASK_PID=1429
> > SLURM_TASKS_PER_NODE=1
> > SLURM_TASKS_PER_NODE=8(x2)
> 64,65d41
> < SLURM_TOPOLOGY_ADDR=node2
> < SLURM_TOPOLOGY_ADDR_PATTERN=node
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep
> ^SLURM_ | sort > hpmpi.out
> [brent_at_node2 mpi]$ diff srun.out hpmpi.out
> 20a21,22
> > SLURM_KILL_BAD_EXIT=1
> > SLURM_KILL_BAD_EXIT=1
> 41,48c43,50
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_STEP_ID=2
> < SLURM_STEP_ID=2
> < SLURM_STEPID=2
> < SLURM_STEPID=2
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_LAUNCHER_PORT=43247
> ---
> > SLURM_SRUN_COMM_PORT=33347
> > SLURM_SRUN_COMM_PORT=33347
> > SLURM_STEP_ID=8
> > SLURM_STEP_ID=8
> > SLURM_STEPID=8
> > SLURM_STEPID=8
> > SLURM_STEP_LAUNCHER_PORT=33347
> > SLURM_STEP_LAUNCHER_PORT=33347
> 59,60c61,62
> < SLURM_TASK_PID=1381
> < SLURM_TASK_PID=2288
> ---
> > SLURM_TASK_PID=1592
> > SLURM_TASK_PID=2590
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$ grep SLURM_PROCID srun.out
> SLURM_PROCID=0
> SLURM_PROCID=1
> [brent_at_node2 mpi]$ grep SLURM_PROCID mpirun.out
> SLURM_PROCID=0
> [brent_at_node2 mpi]$ grep SLURM_PROCID hpmpi.out
> SLURM_PROCID=0
> SLURM_PROCID=1
> [brent_at_node2 mpi]$
>
>
> > -----Original Message-----
> > From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> > Behalf Of Jeff Squyres
> > Sent: Thursday, February 24, 2011 9:31 AM
> > To: Open MPI Users
> > Subject: Re: [OMPI users] SLURM environment variables at runtime
> >
> > The weird thing is that when running his test, he saw different results
> > with HP MPI vs. Open MPI.
> >
> > What his test didn't say was whether those were the same exact nodes or
> > not. It would be good to repeat my experiment with the same exact
> > nodes (e.g., inside one SLURM salloc job, or use the -w param to
> > specify the same nodes for salloc for OMPI and srun for HP MPI).
> >
> >
> > On Feb 24, 2011, at 10:02 AM, Ralph Castain wrote:
> >
> > > Like I said, this isn't an OMPI problem. You have your slurm
> > configured to pass certain envars to the remote nodes, and Brent
> > doesn't. It truly is just that simple.
> > >
> > > I've seen this before with other slurm installations. Which envars
> > get set on the backend is configurable, that's all.
> > >
> > > Has nothing to do with OMPI.
> > >
> > >
> > > On Thu, Feb 24, 2011 at 7:18 AM, Jeff Squyres <jsquyres_at_[hidden]>
> > wrote:
> > > I'm afraid I don't see the problem. Let's get 4 nodes from slurm:
> > >
> > > $ salloc -N 4
> > >
> > > Now let's run env and see what SLURM_ env variables we see:
> > >
> > > $ srun env | egrep ^SLURM_ | head
> > > SLURM_JOB_ID=95523
> > > SLURM_JOB_NUM_NODES=4
> > > SLURM_JOB_NODELIST=svbu-mpi[001-004]
> > > SLURM_JOB_CPUS_PER_NODE=4(x4)
> > > SLURM_JOBID=95523
> > > SLURM_NNODES=4
> > > SLURM_NODELIST=svbu-mpi[001-004]
> > > SLURM_TASKS_PER_NODE=1(x4)
> > > SLURM_PRIO_PROCESS=0
> > > SLURM_UMASK=0002
> > > $ srun env | egrep ^SLURM_ | wc -l
> > > 144
> > >
> > > Good -- there's 144 of them. Let's save them to a file for
> > comparison, later.
> > >
> > > $ srun env | egrep ^SLURM_ | sort > srun.out
> > >
> > > Now let's repeat the process with mpirun. Note that mpirun defaults
> > to running one process per core (vs. srun's default of running one per
> > node). So let's tone mpirun down to use one process per node and look
> > for the SLURM_ env variables.
> > >
> > > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
> > > SLURM_JOB_ID=95523
> > > SLURM_JOB_NUM_NODES=4
> > > SLURM_JOB_NODELIST=svbu-mpi[001-004]
> > > SLURM_JOB_ID=95523
> > > SLURM_JOB_NUM_NODES=4
> > > SLURM_JOB_CPUS_PER_NODE=4(x4)
> > > SLURM_JOBID=95523
> > > SLURM_NNODES=4
> > > SLURM_NODELIST=svbu-mpi[001-004]
> > > SLURM_TASKS_PER_NODE=1(x4)
> > > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
> > > 144
> > >
> > > Good -- we also got 144. Save them to a file.
> > >
> > > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
> > >
> > > Now let's compare what we got from srun and from mpirun:
> > >
> > > $ diff srun.out mpirun.out
> > > 93,108c93,108
> > > < SLURM_SRUN_COMM_PORT=33571
> > > < SLURM_SRUN_COMM_PORT=33571
> > > < SLURM_SRUN_COMM_PORT=33571
> > > < SLURM_SRUN_COMM_PORT=33571
> > > < SLURM_STEP_ID=15
> > > < SLURM_STEP_ID=15
> > > < SLURM_STEP_ID=15
> > > < SLURM_STEP_ID=15
> > > < SLURM_STEPID=15
> > > < SLURM_STEPID=15
> > > < SLURM_STEPID=15
> > > < SLURM_STEPID=15
> > > < SLURM_STEP_LAUNCHER_PORT=33571
> > > < SLURM_STEP_LAUNCHER_PORT=33571
> > > < SLURM_STEP_LAUNCHER_PORT=33571
> > > < SLURM_STEP_LAUNCHER_PORT=33571
> > > ---
> > > > SLURM_SRUN_COMM_PORT=54184
> > > > SLURM_SRUN_COMM_PORT=54184
> > > > SLURM_SRUN_COMM_PORT=54184
> > > > SLURM_SRUN_COMM_PORT=54184
> > > > SLURM_STEP_ID=18
> > > > SLURM_STEP_ID=18
> > > > SLURM_STEP_ID=18
> > > > SLURM_STEP_ID=18
> > > > SLURM_STEPID=18
> > > > SLURM_STEPID=18
> > > > SLURM_STEPID=18
> > > > SLURM_STEPID=18
> > > > SLURM_STEP_LAUNCHER_PORT=54184
> > > > SLURM_STEP_LAUNCHER_PORT=54184
> > > > SLURM_STEP_LAUNCHER_PORT=54184
> > > > SLURM_STEP_LAUNCHER_PORT=54184
> > > 125,128c125,128
> > > < SLURM_TASK_PID=3899
> > > < SLURM_TASK_PID=3907
> > > < SLURM_TASK_PID=3908
> > > < SLURM_TASK_PID=3997
> > > ---
> > > > SLURM_TASK_PID=3924
> > > > SLURM_TASK_PID=3933
> > > > SLURM_TASK_PID=3934
> > > > SLURM_TASK_PID=4039
> > > $
> > >
> > > They're identical except for per-step values (ports, PIDs, etc.) --
> > these differences are expected.
> > >
> > > What version of OMPI are you running? What happens if you repeat
> > this experiment?
> > >
> > > I would find it very strange if Open MPI's mpirun is filtering some
> > SLURM env variables to some processes and not to all -- your output
> > shows disparate output between the different processes. That's just
> > plain weird.
> > >
> > >
> > >
> > > On Feb 23, 2011, at 12:05 PM, Henderson, Brent wrote:
> > >
> > > > SLURM seems to be doing this in the case of a regular srun:
> > > >
> > > > [brent_at_node1 mpi]$ srun -N 2 -n 4 env | egrep
> > SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
> > > > SLURM_LOCALID=0
> > > > SLURM_LOCALID=0
> > > > SLURM_LOCALID=1
> > > > SLURM_LOCALID=1
> > > > SLURM_NODEID=0
> > > > SLURM_NODEID=0
> > > > SLURM_NODEID=1
> > > > SLURM_NODEID=1
> > > > SLURM_PROCID=0
> > > > SLURM_PROCID=1
> > > > SLURM_PROCID=2
> > > > SLURM_PROCID=3
> > > > [brent_at_node1 mpi]$
> > > >
> > > > Since srun is not supported currently by OpenMPI, I have to use
> > salloc - right? In this case, it is up to OpenMPI to interpret the
> > SLURM environment variables it sees in the one process that is launched
> > and 'do the right thing' - whatever that means in this case. How does
> > OpenMPI start the processes on the remote nodes under the covers? (use
> > srun, generate a hostfile and launch as you would outside SLURM, ...)
> > This may be the difference between HP-MPI and OpenMPI.
> > > >
> > > > Thanks,
> > > >
> > > > Brent
> > > >
> > > >
> > > > From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
> > mpi.org] On Behalf Of Ralph Castain
> > > > Sent: Wednesday, February 23, 2011 10:07 AM
> > > > To: Open MPI Users
> > > > Subject: Re: [OMPI users] SLURM environment variables at runtime
> > > >
> > > > Resource managers generally frown on the idea of any program
> > passing RM-managed envars from one node to another, and this is
> > certainly true of slurm. The reason is that the RM reserves those
> > values for its own use when managing remote nodes. For example, if you
> > got an allocation and then used mpirun to launch a job across only a
> > portion of that allocation, and then ran another mpirun instance in
> > parallel on the remainder of the nodes, the slurm envars for those two
> > mpirun instances -need- to be quite different. Having mpirun forward
> > the values it sees would cause the system to become very confused.
> > > >
> > > > We learned the hard way never to cross that line :-(
> > > >
> > > > You have two options:
> > > >
> > > > (a) you could get your sys admin to configure slurm correctly to
> > provide your desired envars on the remote nodes. This is the
> > recommended (by slurm and other RMs) way of getting what you requested.
> > It is a simple configuration option - if he needs help, he should
> > contact the slurm mailing list
> > > >
> > > > (b) you can ask mpirun to do so, at your own risk. Specify each
> > parameter with a "-x FOO" argument. See "man mpirun" for details. Keep
> > an eye out for aberrant behavior.
> > > >
> > > > Ralph
> > > >
> > > >
> > > > On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent
> > <brent.henderson_at_[hidden]> wrote:
> > > > Hi Everyone, I have an OpenMPI/SLURM specific question,
> > > >
> > > > I'm using MPI as a launcher for another application I'm working on
> > and it is dependent on the SLURM environment variables making their way
> > into the a.out's environment. This works as I need if I use HP-
> > MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I
> > would like across all of the ranks.
> > > >
> > > > I have example output below from a simple a.out that just writes
> > out the environment that it sees to a file whose name is based on the
> > node name and rank number. Note that with OpenMPI, that things like
> > SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on
> > the different nodes and things like SLURM_LOCALID are just missing
> > entirely.
> > > >
> > > > So the question is, should the environment variables on the remote
> > nodes (from the perspective of where the job is launched) have the full
> > set of SLURM environment variables as seen on the launching node?
> > > >
> > > > Thanks,
> > > >
> > > > Brent Henderson
> > > >
> > > > [brent_at_node2 mpi]$ rm node*
> > > > [brent_at_node2 mpi]$ mkdir openmpi hpmpi
> > > > [brent_at_node2 mpi]$ salloc -N 2 -n 4 mpirun ./printenv.openmpi
> > > > salloc: Granted job allocation 23
> > > > Hello world! I'm 3 of 4 on node1
> > > > Hello world! I'm 2 of 4 on node1
> > > > Hello world! I'm 1 of 4 on node2
> > > > Hello world! I'm 0 of 4 on node2
> > > > salloc: Relinquishing job allocation 23
> > > > [brent_at_node2 mpi]$ mv node* openmpi/
> > > > [brent_at_node2 mpi]$ egrep
> > 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
> > openmpi/node1.3.of.4
> > > > SLURM_JOB_NODELIST=node[1-2]
> > > > SLURM_NNODES=1
> > > > SLURM_NODELIST=node[1-2]
> > > > SLURM_TASKS_PER_NODE=1
> > > > SLURM_NPROCS=1
> > > > SLURM_STEP_NODELIST=node1
> > > > SLURM_STEP_TASKS_PER_NODE=1
> > > > SLURM_NODEID=0
> > > > SLURM_PROCID=0
> > > > SLURM_LOCALID=0
> > > > [brent_at_node2 mpi]$ egrep
> > 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
> > openmpi/node2.1.of.4
> > > > SLURM_JOB_NODELIST=node[1-2]
> > > > SLURM_NNODES=2
> > > > SLURM_NODELIST=node[1-2]
> > > > SLURM_TASKS_PER_NODE=2(x2)
> > > > SLURM_NPROCS=4
> > > > [brent_at_node2 mpi]$
> > > >
> > > >
> > > > [brent_at_node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4
> > ./printenv.hpmpi
> > > > Hello world! I'm 2 of 4 on node2
> > > > Hello world! I'm 3 of 4 on node2
> > > > Hello world! I'm 0 of 4 on node1
> > > > Hello world! I'm 1 of 4 on node1
> > > > [brent_at_node2 mpi]$ mv node* hpmpi/
> > > > [brent_at_node2 mpi]$ egrep
> > 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
> > hpmpi/node1.1.of.4
> > > > SLURM_NODELIST=node[1-2]
> > > > SLURM_TASKS_PER_NODE=2(x2)
> > > > SLURM_STEP_NODELIST=node[1-2]
> > > > SLURM_STEP_TASKS_PER_NODE=2(x2)
> > > > SLURM_NNODES=2
> > > > SLURM_NPROCS=4
> > > > SLURM_NODEID=0
> > > > SLURM_PROCID=1
> > > > SLURM_LOCALID=1
> > > > [brent_at_node2 mpi]$ egrep
> > 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
> > hpmpi/node2.3.of.4
> > > > SLURM_NODELIST=node[1-2]
> > > > SLURM_TASKS_PER_NODE=2(x2)
> > > > SLURM_STEP_NODELIST=node[1-2]
> > > > SLURM_STEP_TASKS_PER_NODE=2(x2)
> > > > SLURM_NNODES=2
> > > > SLURM_NPROCS=4
> > > > SLURM_NODEID=1
> > > > SLURM_PROCID=3
> > > > SLURM_LOCALID=1
> > > > [brent_at_node2 mpi]$
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > > --
> > > Jeff Squyres
> > > jsquyres_at_[hidden]
> > > For corporate legal information go to:
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>