Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SLURM and OpenMPI
From: Sacerdoti, Federico (Federico.Sacerdoti_at_[hidden])
Date: 2008-06-23 12:17:39


Ralph,

Thanks for your reply. Let me know if I can help in any way.

fds

-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Ralph H Castain
Sent: Thursday, June 19, 2008 10:24 AM
To: Sacerdoti, Federico; Open MPI Users <users_at_[hidden]>
Subject: Re: [OMPI users] SLURM and OpenMPI

Well, if the only system I cared about was slurm, there are some things
I
could possibly do to make things better, but at the expense of our
support
for other environments - which is unacceptable.

There are a few technical barriers to doing this without the orteds on
slurm, and a major licensing issue that prohibits us from calling any
slurm
APIs. How all that gets resolved is unclear.

Frankly, one reason we don't put more emphasis on it is that we don't
see a
significant launch time difference between the two modes, and we truly
do
want to retain the ability to utilize different error response
strategies
(which slurm will not allow - you can only follow theirs).

So I would say we simply have different objectives than what you stated,
and
different concerns that make a deeper slurm integration less favorable.
May
still happen, but not anytime soon.

Ralph

On 6/19/08 8:08 AM, "Sacerdoti, Federico"
<Federico.Sacerdoti_at_[hidden]> wrote:

> Ralph thanks for your quick response.
>
> Regarding your fourth paragraph, slurm will not let you run on a
> no-longer-valid allocation, an srun will correctly exit non-zero with
a
> useful failure reason. So perhaps openmpi 1.3 with your changes will
> just work, I look forward to testing it.
>
> E.g.
> $ srun hostname
> srun: error: Unable to confirm allocation for job 745346: Invalid job
id
> specified
> srun: Check SLURM_JOBID environment variable for expired or invalid
job.
>
>
> Regarding srun to launch the jobs directly (no orteds), I am sad to
hear
> the idea is not in favor. We have found srun to be extremely scalable
> (tested up to 4096 MPI processes) and very good at cleaning up after
an
> error or node failure. It seems you could simplify orterun quite a bit
> by relying on slurm (or whatever resource manager) to handle job
> cleanup after failures; it is their responsibility after all, and they
> have better knowledge about the health and availability of nodes than
> any launcher can hope for.
>
> I helped write an mvapich launcher used internally called mvrun, which
> was used for several years. I wrote a lot of logic to run down and
stop
> all processes when one had failed, which I understand you have as
well.
> We came to the conclusion that slurm was in a better position to
handle
> such failures, and in fact did it more effectively. For example if
slurm
> detects a node has failed, it will stop the job, allocate an
additional
> free node to make up the deficit, then relaunch. It more difficult (to
> put it mildly) for a job launcher to do that.
>
> Thanks again,
> Federico
>
> -----Original Message-----
> From: Ralph H Castain [mailto:rhc_at_[hidden]]
> Sent: Tuesday, June 17, 2008 1:09 PM
> To: Sacerdoti, Federico; Open MPI Users <users_at_[hidden]>
> Subject: Re: [OMPI users] SLURM and OpenMPI
>
> I can believe 1.2.x has problems in that regard. Some of that has
> nothing to
> do with slurm and reflects internal issues with 1.2.
>
> We have made it much more resistant to those problems in the upcoming
> 1.3
> release, but there is no plan to retrofit those changes to 1.2. Part
of
> the
> problem was that we weren't using the --kill-on-bad-exit flag when we
> called
> srun internally, which has been fixed for 1.3.
>
> BTW: we actually do use srun to launch the daemons - we just call it
> internally from inside orterun. The only real difference is that we
use
> orterun to setup the cmd line and then tell the daemons what they need
> to
> do. The issues you are seeing relate to our ability to detect that
srun
> has
> failed, and/or that one or more daemons have failed to launch or do
> something they were supposed to do. The 1.2 system has problems in
that
> regard, which was one motivation for the 1.3 overhaul.
>
> I would argue that slurm allowing us to attempt to launch on a
> no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we
> use
> srun to launch the daemons - the only reason we hang is that srun is
not
> returning with an error. I've seen this on other systems as well, but
> have
> no real answer - if slurm doesn't indicate an error has occurred, I'm
> not
> sure what I can do about it.
>
> We are unlikely to use srun to directly launch jobs (i.e., to have
slurm
> directly launch the job from an srun cmd line without mpirun) anytime
> soon.
> It isn't clear there is enough benefit to justify the rather large
> effort,
> especially considering what would be required to maintain scalability.
> Decisions on all that are still pending, though, which means any
> significant
> change in that regard wouldn't be released until sometime next year.
>
> Ralph
>
> On 6/17/08 10:39 AM, "Sacerdoti, Federico"
> <Federico.Sacerdoti_at_[hidden]> wrote:
>
>> Ralph,
>>
>> I was wondering what the status of this feature was (using srun to
>> launch orted daemons)? I have two new bug reports to add from our
>> experience using orterun from 1.2.6 on our 4000 CPU infiniband
> cluster.
>>
>> 1. Orterun will happily hang if it is asked to run on an invalid
slurm
>> job, e.g. if the job has exceeded its timelimit. This would be
> trivially
>> fixed if you used srun to launch, as they would fail with non-zero
> exit
>> codes.
>>
>> 2. A very simple orterun invocation hangs instead of exiting with an
>> error. In this case the executable does not exist, and we would
expect
>> orterun to exit non-zero. This has caused
>> headaches with some workflow management script that automatically
> start
>> jobs.
>>
>> salloc -N2 -p swdev orterun dummy-binary-I-dont-exist
>> [hang]
>>
>> orterun dummy-binary-I-dont-exist
>> [hang]
>>
>> Thanks,
>> Federico
>>
>> -----Original Message-----
>> From: Sacerdoti, Federico
>> Sent: Friday, March 21, 2008 5:41 PM
>> To: 'Open MPI Users'
>> Subject: RE: [OMPI users] SLURM and OpenMPI
>>
>>
>> Ralph wrote:
>> "I don't know if I would say we "interfere" with SLURM - I would say
>> that we
>> are only lightly integrated with SLURM at this time. We use SLURM as
a
>> resource manager to assign nodes, and then map processes onto those
>> nodes
>> according to the user's wishes. We chose to do this because srun
> applies
>> its
>> own load balancing algorithms if you launch processes directly with
> it,
>> which leaves the user with little flexibility to specify their
desired
>> rank/slot mapping. We chose to support the greater flexibility."
>>
>> Ralph, we wrote a launcher for mvapich that uses srun to launch but
>> keeps tight control of where processes are started. The way we did it
>> was to force srun to launch a single process on a particular node.
>>
>> The launcher calls many of these:
>> srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS
>>
>> Hope this helps (and we are looking forward to a tighter
orterun/slurm
>> integration as you know).
>>
>> Regards,
>> Federico
>>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On
>> Behalf Of Ralph Castain
>> Sent: Thursday, March 20, 2008 6:41 PM
>> To: Open MPI Users <users_at_[hidden]>
>> Cc: Ralph Castain
>> Subject: Re: [OMPI users] SLURM and OpenMPI
>>
>> Hi there
>>
>> I am no slurm expert. However, it is our understanding that
>> SLURM_TASKS_PER_NODE means the number of slots allocated to the job,
> not
>> the
>> number of tasks to be executed on each node. So the 4(x2) tells us
> that
>> we
>> have 4 slots on each of two nodes to work with. You got 4 slots on
> each
>> node
>> because you used the -N option, which told slurm to assign all slots
> on
>> that
>> node to this job - I assume you have 4 processors on your nodes.
> OpenMPI
>> parses that string to get the allocation, then maps the number of
>> specified
>> processes against it.
>>
>> It is possible that the interpretation of SLURM_TASKS_PER_NODE is
>> different
>> when used to allocate as opposed to directly launch processes. Our
>> typical
>> usage is for someone to do:
>>
>> srun -N 2 -A
>> mpirun -np 2 helloworld
>>
>> In other words, we use srun to create an allocation, and then run
> mpirun
>> separately within it.
>>
>>
>> I am therefore unsure what the "-n 2" will do here. If I believe the
>> documentation, it would seem to imply that srun will attempt to
launch
>> two
>> copies of "mpirun -np 2 helloworld", yet your output doesn't seem to
>> support
>> that interpretation. It would appear that the "-n 2" is being ignored
>> and
>> only one copy of mpirun is being launched. I'm no slurm expert, so
>> perhaps
>> that interpretation is incorrect.
>>
>> Assuming that the -n 2 is ignored in this situation, your command
> line:
>>
>>> srun -N 2 -n 2 -b mpirun -np 2 helloworld
>>
>> will cause mpirun to launch two processes, mapped byslot against the
>> slurm
>> allocation of two nodes, each having 4 slots. Thus, both processes
> will
>> be
>> launched on the first node, which is what you observed.
>>
>> Similarly, the command line
>>
>>> srun -N 2 -n 2 -b mpirun helloworld
>>
>> doesn't specify the #procs to mpirun. In that case, mpirun will
launch
> a
>> process on every available slot in the allocation. Given this
command,
>> that
>> means 4 procs will be launched on each of the 2 nodes, for a total of
> 8
>> procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the
>> second.
>> Again, this is what you observed.
>>
>> I don't know if I would say we "interfere" with SLURM - I would say
> that
>> we
>> are only lightly integrated with SLURM at this time. We use SLURM as
a
>> resource manager to assign nodes, and then map processes onto those
>> nodes
>> according to the user's wishes. We chose to do this because srun
> applies
>> its
>> own load balancing algorithms if you launch processes directly with
> it,
>> which leaves the user with little flexibility to specify their
desired
>> rank/slot mapping. We chose to support the greater flexibility.
>>
>> Using the SLURM-defined mapping will require launching without our
>> mpirun.
>> This capability is still under development, and there are issues with
>> doing
>> that in slurm environments which need to be addressed. It is at a
> lower
>> priority than providing such support for TM right now, so I wouldn't
>> expect
>> it to become available for several months at least.
>>
>> Alternatively, it may be possible for mpirun to get the SLURM-defined
>> mapping and use it to launch the processes. If we can get it somehow,
>> there
>> is no problem launching it as specified - the problem is how to get
> the
>> map!
>> Unfortunately, slurm's licensing prevents us from using its internal
>> APIs,
>> so obtaining the map is not an easy thing to do.
>>
>> Anyone who wants to help accelerate that timetable is welcome to
> contact
>> me.
>> We know the technical issues - this is mostly a problem of (a)
>> priorities
>> versus my available time, and (b) similar considerations on the part
> of
>> the
>> slurm folks to do the work themselves.
>>
>> Ralph
>>
>>
>> On 3/20/08 3:48 PM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>
>>> Hi Werner,
>>>
>>> Open MPI does things a little bit differently than other MPIs when
it
>>> comes to supporting SLURM. See
>>> http://www.open-mpi.org/faq/?category=slurm
>>> for general information about running with Open MPI on SLURM.
>>>
>>> After trying the commands you sent, I am actually a bit surprised by
>> the
>>> results. I would have expected this mode of operation to work. But
>>> looking at the environment variables that SLURM is setting for us, I
>> can
>>> see why it doesn't.
>>>
>>> On a cluster with 4 cores/node, I ran:
>>> [tprins_at_odin ~]$ cat mprun.sh
>>> #!/bin/sh
>>> printenv
>>> [tprins_at_odin ~]$ srun -N 2 -n 2 -b mprun.sh
>>> srun: jobid 55641 submitted
>>> [tprins_at_odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
>>> SLURM_TASKS_PER_NODE=4(x2)
>>> [tprins_at_odin ~]$
>>>
>>> Which seems to be wrong, since the srun man page says that
>>> SLURM_TASKS_PER_NODE is the "Number of tasks to be initiated on
each
>>> node". This seems to imply that the value should be "1(x2)". So
maybe
>>> this is a SLURM problem? If this value were correctly reported, Open
>> MPI
>>> should work fine for what you wanted to do.
>>>
>>> Two other things:
>>> 1. You should probably use the command line option '--npernode' for
>>> mpirun instead of setting the rmaps_base_n_pernode directly.
>>> 2. In regards to your second example below, Open MPI by default maps
>> 'by
>>> slot'. That is, it will fill all available slots on the first node
>>> before moving to the second. You can change this, see:
>>> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling
>>>
>>> I have copied Ralph on this mail to see if he has a better response.
>>>
>>> Tim
>>>
>>> Werner Augustin wrote:
>>>> Hi,
>>>>
>>>> At our site here at the University of Karlsruhe we are running two
>>>> large clusters with SLURM and HP-MPI. For our new cluster we want
to
>>>> keep SLURM and switch to OpenMPI. While testing I got the following
>>>> problem:
>>>>
>>>> with HP-MPI I do something like
>>>>
>>>> srun -N 2 -n 2 -b mpirun -srun helloworld
>>>>
>>>> and get
>>>>
>>>> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.
>>>>
>>>> when I try the same with OpenMPI (version 1.2.4)
>>>>
>>>> srun -N 2 -n 2 -b mpirun helloworld
>>>>
>>>> I get
>>>>
>>>> Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
>>>> Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
>>>> Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
>>>> Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.
>>>>
>>>> and with
>>>>
>>>> srun -N 2 -n 2 -b mpirun -np 2 helloworld
>>>>
>>>> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.
>>>>
>>>> which is still wrong, because it uses only one of the two allocated
>>>> nodes.
>>>>
>>>> OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE
environment
>>>> variables, starts with slurm one orted per node and tasks upto the
>>>> maximum number of slots on every node. So basically it also does
>>>> some 'resource management' and interferes with slurm. OK, I can fix
>> that
>>>> with a mpirun wrapper script which calls mpirun with the right -np
>> and
>>>> the right rmaps_base_n_pernode setting, but it gets worse. We want
> to
>>>> allocate computing power on a per cpu base instead of per node,
i.e.
>>>> different user might share a node. In addition slurm allows to
>> schedule
>>>> according to memory usage. Therefore it is important that on every
>> node
>>>> there is exactly the number of tasks running that slurm wants. The
>> only
>>>> solution I came up with is to generate for every job a detailed
>>>> hostfile and call mpirun --hostfile. Any suggestions for
> improvement?
>>>>
>>>> I've found a discussion thread "slurm and all-srun orterun" in the
>>>> mailinglist archive concerning the same problem, where Ralph
Castain
>>>> announced that he is working on two new launch methods which would
>> fix
>>>> my problems. Unfortunately his email address is deleted from the
>>>> archive, so it would be really nice if the friendly elf mentioned
>> there
>>>> is still around and could forward my mail to him.
>>>>
>>>> Thanks in advance,
>>>> Werner Augustin
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users