Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SLURM environment variables at runtime
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-02-24 10:41:06


On Thu, Feb 24, 2011 at 8:30 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> The weird thing is that when running his test, he saw different results
> with HP MPI vs. Open MPI.
>

It sounded quite likely that HP MPI is picking up and moving the envars
itself - that possibility was implied, but not clearly stated.

>
> What his test didn't say was whether those were the same exact nodes or
> not. It would be good to repeat my experiment with the same exact nodes
> (e.g., inside one SLURM salloc job, or use the -w param to specify the same
> nodes for salloc for OMPI and srun for HP MPI).
>

We should note that you -can- directly srun an OMPI job now. I believe that
capability was released in the 1.5 series. It takes a minimum slurm release
level plus a slurm configuration setting to do so.

>
>
> On Feb 24, 2011, at 10:02 AM, Ralph Castain wrote:
>
> > Like I said, this isn't an OMPI problem. You have your slurm configured
> to pass certain envars to the remote nodes, and Brent doesn't. It truly is
> just that simple.
> >
> > I've seen this before with other slurm installations. Which envars get
> set on the backend is configurable, that's all.
> >
> > Has nothing to do with OMPI.
> >
> >
> > On Thu, Feb 24, 2011 at 7:18 AM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
> > I'm afraid I don't see the problem. Let's get 4 nodes from slurm:
> >
> > $ salloc -N 4
> >
> > Now let's run env and see what SLURM_ env variables we see:
> >
> > $ srun env | egrep ^SLURM_ | head
> > SLURM_JOB_ID=95523
> > SLURM_JOB_NUM_NODES=4
> > SLURM_JOB_NODELIST=svbu-mpi[001-004]
> > SLURM_JOB_CPUS_PER_NODE=4(x4)
> > SLURM_JOBID=95523
> > SLURM_NNODES=4
> > SLURM_NODELIST=svbu-mpi[001-004]
> > SLURM_TASKS_PER_NODE=1(x4)
> > SLURM_PRIO_PROCESS=0
> > SLURM_UMASK=0002
> > $ srun env | egrep ^SLURM_ | wc -l
> > 144
> >
> > Good -- there's 144 of them. Let's save them to a file for comparison,
> later.
> >
> > $ srun env | egrep ^SLURM_ | sort > srun.out
> >
> > Now let's repeat the process with mpirun. Note that mpirun defaults to
> running one process per core (vs. srun's default of running one per node).
> So let's tone mpirun down to use one process per node and look for the
> SLURM_ env variables.
> >
> > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
> > SLURM_JOB_ID=95523
> > SLURM_JOB_NUM_NODES=4
> > SLURM_JOB_NODELIST=svbu-mpi[001-004]
> > SLURM_JOB_ID=95523
> > SLURM_JOB_NUM_NODES=4
> > SLURM_JOB_CPUS_PER_NODE=4(x4)
> > SLURM_JOBID=95523
> > SLURM_NNODES=4
> > SLURM_NODELIST=svbu-mpi[001-004]
> > SLURM_TASKS_PER_NODE=1(x4)
> > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
> > 144
> >
> > Good -- we also got 144. Save them to a file.
> >
> > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
> >
> > Now let's compare what we got from srun and from mpirun:
> >
> > $ diff srun.out mpirun.out
> > 93,108c93,108
> > < SLURM_SRUN_COMM_PORT=33571
> > < SLURM_SRUN_COMM_PORT=33571
> > < SLURM_SRUN_COMM_PORT=33571
> > < SLURM_SRUN_COMM_PORT=33571
> > < SLURM_STEP_ID=15
> > < SLURM_STEP_ID=15
> > < SLURM_STEP_ID=15
> > < SLURM_STEP_ID=15
> > < SLURM_STEPID=15
> > < SLURM_STEPID=15
> > < SLURM_STEPID=15
> > < SLURM_STEPID=15
> > < SLURM_STEP_LAUNCHER_PORT=33571
> > < SLURM_STEP_LAUNCHER_PORT=33571
> > < SLURM_STEP_LAUNCHER_PORT=33571
> > < SLURM_STEP_LAUNCHER_PORT=33571
> > ---
> > > SLURM_SRUN_COMM_PORT=54184
> > > SLURM_SRUN_COMM_PORT=54184
> > > SLURM_SRUN_COMM_PORT=54184
> > > SLURM_SRUN_COMM_PORT=54184
> > > SLURM_STEP_ID=18
> > > SLURM_STEP_ID=18
> > > SLURM_STEP_ID=18
> > > SLURM_STEP_ID=18
> > > SLURM_STEPID=18
> > > SLURM_STEPID=18
> > > SLURM_STEPID=18
> > > SLURM_STEPID=18
> > > SLURM_STEP_LAUNCHER_PORT=54184
> > > SLURM_STEP_LAUNCHER_PORT=54184
> > > SLURM_STEP_LAUNCHER_PORT=54184
> > > SLURM_STEP_LAUNCHER_PORT=54184
> > 125,128c125,128
> > < SLURM_TASK_PID=3899
> > < SLURM_TASK_PID=3907
> > < SLURM_TASK_PID=3908
> > < SLURM_TASK_PID=3997
> > ---
> > > SLURM_TASK_PID=3924
> > > SLURM_TASK_PID=3933
> > > SLURM_TASK_PID=3934
> > > SLURM_TASK_PID=4039
> > $
> >
> > They're identical except for per-step values (ports, PIDs, etc.) -- these
> differences are expected.
> >
> > What version of OMPI are you running? What happens if you repeat this
> experiment?
> >
> > I would find it very strange if Open MPI's mpirun is filtering some SLURM
> env variables to some processes and not to all -- your output shows
> disparate output between the different processes. That's just plain weird.
> >
> >
> >
> > On Feb 23, 2011, at 12:05 PM, Henderson, Brent wrote:
> >
> > > SLURM seems to be doing this in the case of a regular srun:
> > >
> > > [brent_at_node1 mpi]$ srun -N 2 -n 4 env | egrep
> SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
> > > SLURM_LOCALID=0
> > > SLURM_LOCALID=0
> > > SLURM_LOCALID=1
> > > SLURM_LOCALID=1
> > > SLURM_NODEID=0
> > > SLURM_NODEID=0
> > > SLURM_NODEID=1
> > > SLURM_NODEID=1
> > > SLURM_PROCID=0
> > > SLURM_PROCID=1
> > > SLURM_PROCID=2
> > > SLURM_PROCID=3
> > > [brent_at_node1 mpi]$
> > >
> > > Since srun is not supported currently by OpenMPI, I have to use salloc
> – right? In this case, it is up to OpenMPI to interpret the SLURM
> environment variables it sees in the one process that is launched and ‘do
> the right thing’ – whatever that means in this case. How does OpenMPI start
> the processes on the remote nodes under the covers? (use srun, generate a
> hostfile and launch as you would outside SLURM, …) This may be the
> difference between HP-MPI and OpenMPI.
> > >
> > > Thanks,
> > >
> > > Brent
> > >
> > >
> > > From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Ralph Castain
> > > Sent: Wednesday, February 23, 2011 10:07 AM
> > > To: Open MPI Users
> > > Subject: Re: [OMPI users] SLURM environment variables at runtime
> > >
> > > Resource managers generally frown on the idea of any program passing
> RM-managed envars from one node to another, and this is certainly true of
> slurm. The reason is that the RM reserves those values for its own use when
> managing remote nodes. For example, if you got an allocation and then used
> mpirun to launch a job across only a portion of that allocation, and then
> ran another mpirun instance in parallel on the remainder of the nodes, the
> slurm envars for those two mpirun instances -need- to be quite different.
> Having mpirun forward the values it sees would cause the system to become
> very confused.
> > >
> > > We learned the hard way never to cross that line :-(
> > >
> > > You have two options:
> > >
> > > (a) you could get your sys admin to configure slurm correctly to
> provide your desired envars on the remote nodes. This is the recommended (by
> slurm and other RMs) way of getting what you requested. It is a simple
> configuration option - if he needs help, he should contact the slurm mailing
> list
> > >
> > > (b) you can ask mpirun to do so, at your own risk. Specify each
> parameter with a "-x FOO" argument. See "man mpirun" for details. Keep an
> eye out for aberrant behavior.
> > >
> > > Ralph
> > >
> > >
> > > On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent <
> brent.henderson_at_[hidden]> wrote:
> > > Hi Everyone, I have an OpenMPI/SLURM specific question,
> > >
> > > I’m using MPI as a launcher for another application I’m working on and
> it is dependent on the SLURM environment variables making their way into the
> a.out’s environment. This works as I need if I use HP-MPI/PMPI, but when I
> use OpenMPI, it appears that not all are set as I would like across all of
> the ranks.
> > >
> > > I have example output below from a simple a.out that just writes out
> the environment that it sees to a file whose name is based on the node name
> and rank number. Note that with OpenMPI, that things like SLURM_NNODES and
> SLURM_TASKS_PER_NODE are not set the same for ranks on the different nodes
> and things like SLURM_LOCALID are just missing entirely.
> > >
> > > So the question is, should the environment variables on the remote
> nodes (from the perspective of where the job is launched) have the full set
> of SLURM environment variables as seen on the launching node?
> > >
> > > Thanks,
> > >
> > > Brent Henderson
> > >
> > > [brent_at_node2 mpi]$ rm node*
> > > [brent_at_node2 mpi]$ mkdir openmpi hpmpi
> > > [brent_at_node2 mpi]$ salloc -N 2 -n 4 mpirun ./printenv.openmpi
> > > salloc: Granted job allocation 23
> > > Hello world! I'm 3 of 4 on node1
> > > Hello world! I'm 2 of 4 on node1
> > > Hello world! I'm 1 of 4 on node2
> > > Hello world! I'm 0 of 4 on node2
> > > salloc: Relinquishing job allocation 23
> > > [brent_at_node2 mpi]$ mv node* openmpi/
> > > [brent_at_node2 mpi]$ egrep
> 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
> openmpi/node1.3.of.4
> > > SLURM_JOB_NODELIST=node[1-2]
> > > SLURM_NNODES=1
> > > SLURM_NODELIST=node[1-2]
> > > SLURM_TASKS_PER_NODE=1
> > > SLURM_NPROCS=1
> > > SLURM_STEP_NODELIST=node1
> > > SLURM_STEP_TASKS_PER_NODE=1
> > > SLURM_NODEID=0
> > > SLURM_PROCID=0
> > > SLURM_LOCALID=0
> > > [brent_at_node2 mpi]$ egrep
> 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
> openmpi/node2.1.of.4
> > > SLURM_JOB_NODELIST=node[1-2]
> > > SLURM_NNODES=2
> > > SLURM_NODELIST=node[1-2]
> > > SLURM_TASKS_PER_NODE=2(x2)
> > > SLURM_NPROCS=4
> > > [brent_at_node2 mpi]$
> > >
> > >
> > > [brent_at_node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4
> ./printenv.hpmpi
> > > Hello world! I'm 2 of 4 on node2
> > > Hello world! I'm 3 of 4 on node2
> > > Hello world! I'm 0 of 4 on node1
> > > Hello world! I'm 1 of 4 on node1
> > > [brent_at_node2 mpi]$ mv node* hpmpi/
> > > [brent_at_node2 mpi]$ egrep
> 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node1.1.of.4
> > > SLURM_NODELIST=node[1-2]
> > > SLURM_TASKS_PER_NODE=2(x2)
> > > SLURM_STEP_NODELIST=node[1-2]
> > > SLURM_STEP_TASKS_PER_NODE=2(x2)
> > > SLURM_NNODES=2
> > > SLURM_NPROCS=4
> > > SLURM_NODEID=0
> > > SLURM_PROCID=1
> > > SLURM_LOCALID=1
> > > [brent_at_node2 mpi]$ egrep
> 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node2.3.of.4
> > > SLURM_NODELIST=node[1-2]
> > > SLURM_TASKS_PER_NODE=2(x2)
> > > SLURM_STEP_NODELIST=node[1-2]
> > > SLURM_STEP_TASKS_PER_NODE=2(x2)
> > > SLURM_NNODES=2
> > > SLURM_NPROCS=4
> > > SLURM_NODEID=1
> > > SLURM_PROCID=3
> > > SLURM_LOCALID=1
> > > [brent_at_node2 mpi]$
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>