Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SLURM environment variables at runtime
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-02-24 11:16:42


FWIW, I'm running Slurm 2.1.0 -- I haven't updated to 2.2.x. yet.

Just to be sure, I re-ran my test with OMPI 1.4.3 (I was using the OMPI development SVN trunk before) and got the same results:

----
$ srun env | egrep ^SLURM_ | wc -l
144
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
----
I find it strange that "srun env ..." and HPMPI's "mpirun env..." return (effectively) the same results, but OMPI's "mpirun env ..." returns something different.
Perhaps SLURM changed something in 2.2.x...?  As Ralph mentioned, OMPI *shouldn't* be altering the environment w.r.t. SLURM variables that you get -- whatever SLURM sets, that's what you should get in an OMPI-launched process.
On Feb 24, 2011, at 10:55 AM, Henderson, Brent wrote:
> I'm running OpenMPI v1.4.3 and slurm v2.2.1.  I built both with the default configuration except setting the prefix.  The tests were run on the exact same nodes (I only have two).
> 
> When I run the test you outline below, I am still missing a bunch of env variables with OpenMPI.  I ran the extra test of using HP-MPI and they are all present as with the srun invocation.  I don't know if this is my slurm setup or not, but I find this really weird.  If anyone knows the magic to make the fix that Ralph is referring to, I'd appreciate a pointer.
> 
> My guess was that there is a subtle way that the launch differs between the two products.  But, since it works for Jeff, maybe there really is a slurm option that I need to compile in or set to make this work the way I want.  It is not as simple as HP-MPI moving the environment variables itself as some of the numbers will change per process created on the remote nodes.
> 
> Thanks,
> 
> Brent
> 
> [brent_at_node2 mpi]$ salloc -N 2
> salloc: Granted job allocation 29
> [brent_at_node2 mpi]$ srun env | egrep ^SLURM_ | head
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=1(x2)
> SLURM_JOB_ID=29
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=1(x2)
> SLURM_JOB_ID=29
> [brent_at_node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
> 66
> [brent_at_node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
> [brent_at_node2 mpi]$ which mpirun
> ~/bin/openmpi143/bin/mpirun
> [brent_at_node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=8(x2)
> SLURM_JOB_ID=29
> SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
> SLURM_JOB_NODELIST=node[1-2]
> SLURM_JOB_CPUS_PER_NODE=8(x2)
> SLURM_JOB_NUM_NODES=2
> SLURM_NODELIST=node[1-2]
> [brent_at_node2 mpi]$ which mpirun
> ~/bin/openmpi143/bin/mpirun
> [brent_at_node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l
> 42  <-- note, not 66 as above!
> [brent_at_node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort > mpirun.out
> [brent_at_node2 mpi]$ diff srun.out mpirun.out
> 2d1
> < SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
> 4,5d2
> < SLURM_CPUS_ON_NODE=8
> < SLURM_CPUS_PER_TASK=1
> 8d4
> < SLURM_DISTRIBUTION=cyclic
> 10d5
> < SLURM_GTIDS=1
> 22,23d16
> < SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
> < SLURM_LOCALID=0
> 25c18
> < SLURM_NNODES=2
> ---
>> SLURM_NNODES=1
> 28d20
> < SLURM_NODEID=1
> 31,35c23,24
> < SLURM_NPROCS=2
> < SLURM_NPROCS=2
> < SLURM_NTASKS=2
> < SLURM_NTASKS=2
> < SLURM_PRIO_PROCESS=0
> ---
>> SLURM_NPROCS=1
>> SLURM_NTASKS=1
> 38d26
> < SLURM_PROCID=1
> 40,56c28,35
> < SLURM_SRUN_COMM_HOST=10.0.205.134
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_STEP_ID=2
> < SLURM_STEP_ID=2
> < SLURM_STEPID=2
> < SLURM_STEPID=2
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_NODELIST=node[1-2]
> < SLURM_STEP_NODELIST=node[1-2]
> < SLURM_STEP_NUM_NODES=2
> < SLURM_STEP_NUM_NODES=2
> < SLURM_STEP_NUM_TASKS=2
> < SLURM_STEP_NUM_TASKS=2
> < SLURM_STEP_TASKS_PER_NODE=1(x2)
> < SLURM_STEP_TASKS_PER_NODE=1(x2)
> ---
>> SLURM_SRUN_COMM_PORT=45154
>> SLURM_STEP_ID=5
>> SLURM_STEPID=5
>> SLURM_STEP_LAUNCHER_PORT=45154
>> SLURM_STEP_NODELIST=node1
>> SLURM_STEP_NUM_NODES=1
>> SLURM_STEP_NUM_TASKS=1
>> SLURM_STEP_TASKS_PER_NODE=1
> 59,62c38,40
> < SLURM_TASK_PID=1381
> < SLURM_TASK_PID=2288
> < SLURM_TASKS_PER_NODE=1(x2)
> < SLURM_TASKS_PER_NODE=1(x2)
> ---
>> SLURM_TASK_PID=1429
>> SLURM_TASKS_PER_NODE=1
>> SLURM_TASKS_PER_NODE=8(x2)
> 64,65d41
> < SLURM_TOPOLOGY_ADDR=node2
> < SLURM_TOPOLOGY_ADDR_PATTERN=node
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep ^SLURM_ | sort > hpmpi.out
> [brent_at_node2 mpi]$ diff srun.out hpmpi.out
> 20a21,22
>> SLURM_KILL_BAD_EXIT=1
>> SLURM_KILL_BAD_EXIT=1
> 41,48c43,50
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_STEP_ID=2
> < SLURM_STEP_ID=2
> < SLURM_STEPID=2
> < SLURM_STEPID=2
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_LAUNCHER_PORT=43247
> ---
>> SLURM_SRUN_COMM_PORT=33347
>> SLURM_SRUN_COMM_PORT=33347
>> SLURM_STEP_ID=8
>> SLURM_STEP_ID=8
>> SLURM_STEPID=8
>> SLURM_STEPID=8
>> SLURM_STEP_LAUNCHER_PORT=33347
>> SLURM_STEP_LAUNCHER_PORT=33347
> 59,60c61,62
> < SLURM_TASK_PID=1381
> < SLURM_TASK_PID=2288
> ---
>> SLURM_TASK_PID=1592
>> SLURM_TASK_PID=2590
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$
> [brent_at_node2 mpi]$ grep SLURM_PROCID srun.out
> SLURM_PROCID=0
> SLURM_PROCID=1
> [brent_at_node2 mpi]$ grep SLURM_PROCID mpirun.out
> SLURM_PROCID=0
> [brent_at_node2 mpi]$ grep SLURM_PROCID hpmpi.out
> SLURM_PROCID=0
> SLURM_PROCID=1
> [brent_at_node2 mpi]$
> 
> 
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>> Behalf Of Jeff Squyres
>> Sent: Thursday, February 24, 2011 9:31 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] SLURM environment variables at runtime
>> 
>> The weird thing is that when running his test, he saw different results
>> with HP MPI vs. Open MPI.
>> 
>> What his test didn't say was whether those were the same exact nodes or
>> not.  It would be good to repeat my experiment with the same exact
>> nodes (e.g., inside one SLURM salloc job, or use the -w param to
>> specify the same nodes for salloc for OMPI and srun for HP MPI).
>> 
>> 
>> On Feb 24, 2011, at 10:02 AM, Ralph Castain wrote:
>> 
>>> Like I said, this isn't an OMPI problem. You have your slurm
>> configured to pass certain envars to the remote nodes, and Brent
>> doesn't. It truly is just that simple.
>>> 
>>> I've seen this before with other slurm installations. Which envars
>> get set on the backend is configurable, that's all.
>>> 
>>> Has nothing to do with OMPI.
>>> 
>>> 
>>> On Thu, Feb 24, 2011 at 7:18 AM, Jeff Squyres <jsquyres_at_[hidden]>
>> wrote:
>>> I'm afraid I don't see the problem.  Let's get 4 nodes from slurm:
>>> 
>>> $ salloc -N 4
>>> 
>>> Now let's run env and see what SLURM_ env variables we see:
>>> 
>>> $ srun env | egrep ^SLURM_ | head
>>> SLURM_JOB_ID=95523
>>> SLURM_JOB_NUM_NODES=4
>>> SLURM_JOB_NODELIST=svbu-mpi[001-004]
>>> SLURM_JOB_CPUS_PER_NODE=4(x4)
>>> SLURM_JOBID=95523
>>> SLURM_NNODES=4
>>> SLURM_NODELIST=svbu-mpi[001-004]
>>> SLURM_TASKS_PER_NODE=1(x4)
>>> SLURM_PRIO_PROCESS=0
>>> SLURM_UMASK=0002
>>> $ srun env | egrep ^SLURM_ | wc -l
>>> 144
>>> 
>>> Good -- there's 144 of them.  Let's save them to a file for
>> comparison, later.
>>> 
>>> $ srun env | egrep ^SLURM_ | sort > srun.out
>>> 
>>> Now let's repeat the process with mpirun.  Note that mpirun defaults
>> to running one process per core (vs. srun's default of running one per
>> node).  So let's tone mpirun down to use one process per node and look
>> for the SLURM_ env variables.
>>> 
>>> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
>>> SLURM_JOB_ID=95523
>>> SLURM_JOB_NUM_NODES=4
>>> SLURM_JOB_NODELIST=svbu-mpi[001-004]
>>> SLURM_JOB_ID=95523
>>> SLURM_JOB_NUM_NODES=4
>>> SLURM_JOB_CPUS_PER_NODE=4(x4)
>>> SLURM_JOBID=95523
>>> SLURM_NNODES=4
>>> SLURM_NODELIST=svbu-mpi[001-004]
>>> SLURM_TASKS_PER_NODE=1(x4)
>>> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
>>> 144
>>> 
>>> Good -- we also got 144.  Save them to a file.
>>> 
>>> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
>>> 
>>> Now let's compare what we got from srun and from mpirun:
>>> 
>>> $ diff srun.out mpirun.out
>>> 93,108c93,108
>>> < SLURM_SRUN_COMM_PORT=33571
>>> < SLURM_SRUN_COMM_PORT=33571
>>> < SLURM_SRUN_COMM_PORT=33571
>>> < SLURM_SRUN_COMM_PORT=33571
>>> < SLURM_STEP_ID=15
>>> < SLURM_STEP_ID=15
>>> < SLURM_STEP_ID=15
>>> < SLURM_STEP_ID=15
>>> < SLURM_STEPID=15
>>> < SLURM_STEPID=15
>>> < SLURM_STEPID=15
>>> < SLURM_STEPID=15
>>> < SLURM_STEP_LAUNCHER_PORT=33571
>>> < SLURM_STEP_LAUNCHER_PORT=33571
>>> < SLURM_STEP_LAUNCHER_PORT=33571
>>> < SLURM_STEP_LAUNCHER_PORT=33571
>>> ---
>>>> SLURM_SRUN_COMM_PORT=54184
>>>> SLURM_SRUN_COMM_PORT=54184
>>>> SLURM_SRUN_COMM_PORT=54184
>>>> SLURM_SRUN_COMM_PORT=54184
>>>> SLURM_STEP_ID=18
>>>> SLURM_STEP_ID=18
>>>> SLURM_STEP_ID=18
>>>> SLURM_STEP_ID=18
>>>> SLURM_STEPID=18
>>>> SLURM_STEPID=18
>>>> SLURM_STEPID=18
>>>> SLURM_STEPID=18
>>>> SLURM_STEP_LAUNCHER_PORT=54184
>>>> SLURM_STEP_LAUNCHER_PORT=54184
>>>> SLURM_STEP_LAUNCHER_PORT=54184
>>>> SLURM_STEP_LAUNCHER_PORT=54184
>>> 125,128c125,128
>>> < SLURM_TASK_PID=3899
>>> < SLURM_TASK_PID=3907
>>> < SLURM_TASK_PID=3908
>>> < SLURM_TASK_PID=3997
>>> ---
>>>> SLURM_TASK_PID=3924
>>>> SLURM_TASK_PID=3933
>>>> SLURM_TASK_PID=3934
>>>> SLURM_TASK_PID=4039
>>> $
>>> 
>>> They're identical except for per-step values (ports, PIDs, etc.) --
>> these differences are expected.
>>> 
>>> What version of OMPI are you running?  What happens if you repeat
>> this experiment?
>>> 
>>> I would find it very strange if Open MPI's mpirun is filtering some
>> SLURM env variables to some processes and not to all -- your output
>> shows disparate output between the different processes.  That's just
>> plain weird.
>>> 
>>> 
>>> 
>>> On Feb 23, 2011, at 12:05 PM, Henderson, Brent wrote:
>>> 
>>>> SLURM seems to be doing this in the case of a regular srun:
>>>> 
>>>> [brent_at_node1 mpi]$ srun -N 2 -n 4 env | egrep
>> SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
>>>> SLURM_LOCALID=0
>>>> SLURM_LOCALID=0
>>>> SLURM_LOCALID=1
>>>> SLURM_LOCALID=1
>>>> SLURM_NODEID=0
>>>> SLURM_NODEID=0
>>>> SLURM_NODEID=1
>>>> SLURM_NODEID=1
>>>> SLURM_PROCID=0
>>>> SLURM_PROCID=1
>>>> SLURM_PROCID=2
>>>> SLURM_PROCID=3
>>>> [brent_at_node1 mpi]$
>>>> 
>>>> Since srun is not supported currently by OpenMPI, I have to use
>> salloc - right?  In this case, it is up to OpenMPI to interpret the
>> SLURM environment variables it sees in the one process that is launched
>> and 'do the right thing' - whatever that means in this case.  How does
>> OpenMPI start the processes on the remote nodes under the covers?  (use
>> srun, generate a hostfile and launch as you would outside SLURM, ...)
>> This may be the difference between HP-MPI and OpenMPI.
>>>> 
>>>> Thanks,
>>>> 
>>>> Brent
>>>> 
>>>> 
>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
>> mpi.org] On Behalf Of Ralph Castain
>>>> Sent: Wednesday, February 23, 2011 10:07 AM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] SLURM environment variables at runtime
>>>> 
>>>> Resource managers generally frown on the idea of any program
>> passing RM-managed envars from one node to another, and this is
>> certainly true of slurm. The reason is that the RM reserves those
>> values for its own use when managing remote nodes. For example, if you
>> got an allocation and then used mpirun to launch a job across only a
>> portion of that allocation, and then ran another mpirun instance in
>> parallel on the remainder of the nodes, the slurm envars for those two
>> mpirun instances -need- to be quite different. Having mpirun forward
>> the values it sees would cause the system to become very confused.
>>>> 
>>>> We learned the hard way never to cross that line :-(
>>>> 
>>>> You have two options:
>>>> 
>>>> (a) you could get your sys admin to configure slurm correctly to
>> provide your desired envars on the remote nodes. This is the
>> recommended (by slurm and other RMs) way of getting what you requested.
>> It is a simple configuration option - if he needs help, he should
>> contact the slurm mailing list
>>>> 
>>>> (b) you can ask mpirun to do so, at your own risk. Specify each
>> parameter with a "-x FOO" argument. See "man mpirun" for details. Keep
>> an eye out for aberrant behavior.
>>>> 
>>>> Ralph
>>>> 
>>>> 
>>>> On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent
>> <brent.henderson_at_[hidden]> wrote:
>>>> Hi Everyone, I have an OpenMPI/SLURM specific question,
>>>> 
>>>> I'm using MPI as a launcher for another application I'm working on
>> and it is dependent on the SLURM environment variables making their way
>> into the a.out's environment.  This works as I need if I use HP-
>> MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I
>> would like across all of the ranks.
>>>> 
>>>> I have example output below from a simple a.out that just writes
>> out the environment that it sees to a file whose name is based on the
>> node name and rank number.  Note that with OpenMPI, that things like
>> SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on
>> the different nodes and things like SLURM_LOCALID are just missing
>> entirely.
>>>> 
>>>> So the question is, should the environment variables on the remote
>> nodes (from the perspective of where the job is launched) have the full
>> set of SLURM environment variables as seen on the launching node?
>>>> 
>>>> Thanks,
>>>> 
>>>> Brent Henderson
>>>> 
>>>> [brent_at_node2 mpi]$ rm node*
>>>> [brent_at_node2 mpi]$ mkdir openmpi hpmpi
>>>> [brent_at_node2 mpi]$ salloc -N 2 -n 4 mpirun ./printenv.openmpi
>>>> salloc: Granted job allocation 23
>>>> Hello world! I'm 3 of 4 on node1
>>>> Hello world! I'm 2 of 4 on node1
>>>> Hello world! I'm 1 of 4 on node2
>>>> Hello world! I'm 0 of 4 on node2
>>>> salloc: Relinquishing job allocation 23
>>>> [brent_at_node2 mpi]$ mv node* openmpi/
>>>> [brent_at_node2 mpi]$ egrep
>> 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
>> openmpi/node1.3.of.4
>>>> SLURM_JOB_NODELIST=node[1-2]
>>>> SLURM_NNODES=1
>>>> SLURM_NODELIST=node[1-2]
>>>> SLURM_TASKS_PER_NODE=1
>>>> SLURM_NPROCS=1
>>>> SLURM_STEP_NODELIST=node1
>>>> SLURM_STEP_TASKS_PER_NODE=1
>>>> SLURM_NODEID=0
>>>> SLURM_PROCID=0
>>>> SLURM_LOCALID=0
>>>> [brent_at_node2 mpi]$ egrep
>> 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
>> openmpi/node2.1.of.4
>>>> SLURM_JOB_NODELIST=node[1-2]
>>>> SLURM_NNODES=2
>>>> SLURM_NODELIST=node[1-2]
>>>> SLURM_TASKS_PER_NODE=2(x2)
>>>> SLURM_NPROCS=4
>>>> [brent_at_node2 mpi]$
>>>> 
>>>> 
>>>> [brent_at_node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4
>> ./printenv.hpmpi
>>>> Hello world! I'm 2 of 4 on node2
>>>> Hello world! I'm 3 of 4 on node2
>>>> Hello world! I'm 0 of 4 on node1
>>>> Hello world! I'm 1 of 4 on node1
>>>> [brent_at_node2 mpi]$ mv node* hpmpi/
>>>> [brent_at_node2 mpi]$ egrep
>> 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
>> hpmpi/node1.1.of.4
>>>> SLURM_NODELIST=node[1-2]
>>>> SLURM_TASKS_PER_NODE=2(x2)
>>>> SLURM_STEP_NODELIST=node[1-2]
>>>> SLURM_STEP_TASKS_PER_NODE=2(x2)
>>>> SLURM_NNODES=2
>>>> SLURM_NPROCS=4
>>>> SLURM_NODEID=0
>>>> SLURM_PROCID=1
>>>> SLURM_LOCALID=1
>>>> [brent_at_node2 mpi]$ egrep
>> 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
>> hpmpi/node2.3.of.4
>>>> SLURM_NODELIST=node[1-2]
>>>> SLURM_TASKS_PER_NODE=2(x2)
>>>> SLURM_STEP_NODELIST=node[1-2]
>>>> SLURM_STEP_TASKS_PER_NODE=2(x2)
>>>> SLURM_NNODES=2
>>>> SLURM_NPROCS=4
>>>> SLURM_NODEID=1
>>>> SLURM_PROCID=3
>>>> SLURM_LOCALID=1
>>>> [brent_at_node2 mpi]$
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/