Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-07-16 08:24:26

On Jul 15, 2007, at 11:18 PM, Matthew Moskewicz wrote:

>>> i'll probably just continue experimenting on my own for the
>>> moment (tracking
>>> any updates to the main trunk LSF support) to see if i can figure
>>> it out. any
>>> advice the best way to get such back support into trunk, if and
>>> when if exists
>>> / is working?
>> The *best* way would be for you to sign a third-party agreement -
>> see the
>> web site for details and a copy. Barring that, the only option
>> would be to
>> submit the code through either Jeff or I. We greatly prefer the
>> agreement
>> method as it is (a) less burdensome on us and (b) gives you greater
>> flexibility.
> i'll talk to 'the man' -- it should be okay ... eventually, at
> least ...

See for details. As an
open project, we always welcome new developers, but we do need to
keep the IP tidy.

>> I can't speak to the motivation behind MPI-2 - the others in the
>> group can
>> do a much better job of that. What I can say is that we started
>> out with a
>> design to support such modes of operation as dynamic farms, but
>> the group
>> has been moving away from it due to a combination of performance
>> impacts,
>> reliability, and (frankly) lack of interest from our user
>> community. Our
>> intent now is to cut the RTE back to the basics required to
>> support the MPI
>> standard, including MPI-2 - which arguably says nothing about
>> dynamic
>> resource allocation.
> that's true -- dynamic processes can be useful even under a static
> allocation. in fact, in the short term for my particular application,
> i'll probably do just that -- the user picks an initial allocation,
> and then i just do the best i can. hopefully the allocations will be
> 'small enough' to get away without dynamic acquisition for a while (a
> year?).

FWIW, our experience with the MPI layer has shown that the vast
majority of applications only need a specific set of initial
resources (hosts/cpus) and then just use those. We have seen only a
small class of applications that truly benefit from dynamically
adding / removing resources in the middle of the run. The canonical
manager/worker model fits this criteria (i.e., benefits from
dynamically adding/removing resources), but as you noted, it also
works just fine with a static set of resources. FWIW, I've seen many
MPI applications written with the manager/worker model to ease their
startup with a variable number of nodes (e.g., under a resource
manager) -- they'll just launch as many processes as they get in
their job then then manager/worker from there to "discover" how many
processes they got, use them all, etc.

> beyond that, i guess i'm just one of those guys that thinks
> it's a shame that MPI supplanted pvm so long ago in the first place.
> and yes, i already looked into modifying pvm instead, no thank you ...
> ;)

A religious argument. ;-) There were certainly good things about
PVM, and MPI managed to take at least some of them.

>> Not to say we won't support it - just indicating that such support
>> will have
>> lower priority and that the system will be designed primarily for
>> other
>> priorities. So dynamic resource allocation will have to be
>> considered as an
>> "exception case", with all the attendant implications.
> fair enough. i'm still hoping it won't be too exceptional, really. on
> a related note, perhaps is it possible to 'join' running openMPI jobs
> (using nameservers or whatnot)? if so, then application level
> workarounds are also possible -- and can even be automated if the
> application just launches a whole new copy of itself via whatever
> top-level means was used to launch itself in the first place.

MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
MPI_COMM_CONNECT models. We do support this in Open MPI, but the
restrictions (in terms of ORTE) may not be sufficient for you.

Some other random notes in no particular order:

- As you noted, the LSF support is *very* new; it was just added last

- It also likely doesn't work yet; we started the integration work
and ran into a technical issue that required further discussion with
Platform. They're currently looking into it; we stopped the LSF work
in ORTE until they get back to us.

- FWIW, one of the main reasons OMPI/ORTE didn't add extensive/
flexible support for dynamic addition of resources was the potential
for queue time. Many systems run "full" all the time, so if you try
to acquire more resources, you could just sit in a queue for minutes/
hours/days/weeks before getting nodes. While it is certainly
possible to program with this model, we didn't really want to get
into the rats nest of corner cases that this would entail, especially
since very few users are asking for it.

- That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might*
offer a way out here. But I think a) THREAD_MULTIPLE isn't working
yet (other OMPI members are working on this), and b) even when
THREAD_MULTIPLE works, there will be ORTE issues to deal with
(canceling pending resource allocations, etc.). Ralph mentioned that
someone else is working on such things on the TM/PBS/Torque side; I
haven't followed that effort closely.

> well, certainly part of the issue is the need (or at least strong
> preference) to support 6.2 -- but read on.
> hmm, i'll need to review the APIs in more detail, but here is my
> current understanding:
> there appear to be some overlaps between the ls_* and lsb_* functions,
> but they seem basically compatible as far as i can tell. almost all
> the functions have a command line version as well, for example:
> lsb_submit()/bsub
> lsb_getalloc()/none and lsb_launch()/blaunch are new with LSF 7.0, but
> appear to just be a different (simpler) interface to existing
> functionality in the LSB_* env vars and the ls_rexec()/lsgrun commands
> -- although, as you say, perhaps platform will hook or enhance them
> later. but, the key issue is that lsb_launch() just starts tasks -- it
> does not perform or interact with the queue or job control (much?).
> so, you can't use these functions to get an allocation in the first
> place, and you have to be careful not to use them as a way around the
> queuing system.
> [ as a side note, the function ls_rexecv()/lsgrun is the one i have
> heard admins do not like because it can break queuing/accounting, and
> might try to disable somehow. i don't really buy that, because it's
> not you can disable it and have the system still work, since (as
> above) || job launching depends on it. i guess if you really don't
> care about || launching maybe you could. but, if used properly after a
> proper allocation i don't think there should (or even can) be a
> problem. ]
> so, lsb_submit()/bsub is a combination allocate/launch -- you specify
> the allocation size you want, and when it's all ready, it runs the
> 'job' (really the job launcher) only on one (randomly chosen) 'head'
> node from the allocation, with the env vars set so the launcher can
> use ls_rexec/lsgrun functions to start the rest of the job. there are
> of course various script wrappers you can use (mpijob, pvmjob, etc)
> instead of your 'real job'. then, i think lsf *should* try to track
> what processes get started via the wrapper / head process so it knows
> they are part of the same job. i dunno if it really does that -- but,
> my guess is that at the least it assumes the allocation is in use
> until the original process ends. in any case, the wrapper / head
> process examines the environment vars and uses ls_rexec()/lsgrun or
> the like to actually run N copies of the 'real job' executable. in
> 7.0, it can conveniently use lsb_getalloc() and lsb_launch(), but that
> doesn't really change any semantics as far as i know. one could
> imaging that calling lsb_launch() instead of ls_rexec() might be
> preferable from a process tracking point of view, but i don't see why
> Platform couldn't hook ls_rexec() just as well as lsb_launch().
> i really need to get a little more confidence on that issue, since
> it's what determines what actions will (or perhaps already do in
> practice) 'break' the queuing/reporting system.
> there are some 'allocate only' functions as well, such as
> ls_placereq()/lsplace -- these can just return a host list / set the
> env vars without running anything at first. apparently, you need to
> run something 'soon' on the resultant hosts or the load balancer might
> get confused and reuse them. also, since this doesn't seem to go
> through the queues, it's probably not a viable set of functions to
> really use. a red herring, as far as i'm concerned.
> there is also an lsb_runjob() that is similar to lsb_launch(), but for
> an already submitted job. so, if one were to lsb_sumbit() with an
> option set to never launch it automatically, and then one were to run
> lsb_runjob(), you can avoid the queue and/or force the use of certain
> hosts? i guess this is also not a good function to use, but at least
> the queuing system would be aware of any bad behavior (queue skipping
> via ls_placereq() to get extra hosts, for instance) in this case ...
> there does *not* appear to be an option to lsb_submit() that allows a
> non-blocking programmatic callback when allocation is complete. if
> there was, it would need to deal with process tracking issues, or
> maybe just merge the old and new jobs somehow in that case.
> so to speak to the original point, it would indeed be nice to be able
> to do additional allocations (and then an lsb_launch) with a simple
> programmatic interface for completeness, but i don't see one. however,
> lsb_submit() is pretty close -- it makes a 'new' job, but i think
> that's okay. the initial daemon that gets run on the 'head' (i.e.
> randomly chosen) node of the new job will run an lsb_launch() or
> similar to start up the remaining N-1 daemons as children -- thus
> hopefully keeping the queuing system and process tracking happy. or
> you could use some LSF option / wrapper script to tell it to run the
> same daemon on all N hosts for you, if a some suitable option/wrapper
> exists anyway. so, in summary lsb_sumit() does allocation + one
> (non-optional) launch on allocation completion. lsb_launch() (or
> similar) does only launching, should probably only be run from the
> single process started from an lsb_submit(), and should only launch
> things on the allocation given by lsb_getalloc() (or env vars).

I am certainly not an expert on LSF (nor its API) -- I only started
using it last week! Do you have any contacts to ask at Platform?
They would likely be the best ones to discuss this with.

Jeff Squyres
Cisco Systems