Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Matthew Moskewicz (moskewcz_at_[hidden])
Date: 2007-07-15 23:18:25


hi again,

[i'm going to snip out the sections that seem resolved]
[also, sorry about mutating the subject last time -- oops.]

>
> This sounds fine - you'll find that the bproc pls does the exact same thing.
> In that case, we use #ifdefs since the APIs are actually different between
> the versions - we just create a wrapper inside the bproc pls code for the
> older version so that we can always call the same API. I'm not sure what the
> case will be in LSF - I believe the function calls are indeed different, so
> you might be able to use the same approach.

okay

>
> > i'll probably just continue experimenting on my own for the moment (tracking
> > any updates to the main trunk LSF support) to see if i can figure it out. any
> > advice the best way to get such back support into trunk, if and when if exists
> > / is working?
>
>
> The *best* way would be for you to sign a third-party agreement - see the
> web site for details and a copy. Barring that, the only option would be to
> submit the code through either Jeff or I. We greatly prefer the agreement
> method as it is (a) less burdensome on us and (b) gives you greater
> flexibility.
>

i'll talk to 'the man' -- it should be okay ... eventually, at least ...

>
> I can't speak to the motivation behind MPI-2 - the others in the group can
> do a much better job of that. What I can say is that we started out with a
> design to support such modes of operation as dynamic farms, but the group
> has been moving away from it due to a combination of performance impacts,
> reliability, and (frankly) lack of interest from our user community. Our
> intent now is to cut the RTE back to the basics required to support the MPI
> standard, including MPI-2 - which arguably says nothing about dynamic
> resource allocation.

that's true -- dynamic processes can be useful even under a static
allocation. in fact, in the short term for my particular application,
i'll probably do just that -- the user picks an initial allocation,
and then i just do the best i can. hopefully the allocations will be
'small enough' to get away without dynamic acquisition for a while (a
year?). beyond that, i guess i'm just one of those guys that thinks
it's a shame that MPI supplanted pvm so long ago in the first place.
and yes, i already looked into modifying pvm instead, no thank you ...
;)

>
> Not to say we won't support it - just indicating that such support will have
> lower priority and that the system will be designed primarily for other
> priorities. So dynamic resource allocation will have to be considered as an
> "exception case", with all the attendant implications.
>

fair enough. i'm still hoping it won't be too exceptional, really. on
a related note, perhaps is it possible to 'join' running openMPI jobs
(using nameservers or whatnot)? if so, then application level
workarounds are also possible -- and can even be automated if the
application just launches a whole new copy of itself via whatever
top-level means was used to launch itself in the first place.

> I think someone is feeding you a very extreme view of LSF. I have interacted
> for years with people working with LSF-based systems, and can count on the
> fingers of one hand the people who are operating the way you describe.

perhaps -- i'm trying to convince the guy it's worth taking a look at
enhancing open-mpi/open-rte as opposed to continuing with his internal
effort. maybe i'll get him to chime in directly on this issue --
however ...

>
> *Can* you use LSF that way? Sure. Is that how most people use it? Not from
> what I have seen. Still, if that's a mode you want to support...have at it!
> ;-)
>

... that said, his library already has the needed workarounds for this
usage model. still, the communication is much simpler -- TCP point to
point only (which is 'enough' for me now, but i'm not sure about the
future), and i'm a little worried about the maturity and (software
engineering and performance wise) scalability of his effort.

>
> Keep in mind, though, that Open MPI is driven by performance for large-scale
> multiprocessor computations. As I indicated earlier, the type of operation
> you are describing will have to be treated as an "exception case".
> Literally, this means you are welcome to try and make it work, but the
> fundamental operations of the system won't be designed to optimize that mode
> at the sacrifice of the primary objective.
>

again, fair enough. ;)

> > duly noted. i don't pretend to be able to follow the current control flow at
> > the moment. i think just running the debug version with all the printouts
> > should help me a lot there. also, perhaps if i just make a rmgr_dyn_lsf, and
> > don't use sds, then there might not be as many subsystems involved to
> > complain. actually, i suspect the LSF specific part would be (very) small, so
> > perhaps it could be rmgr_dynurm + a new component type like dynraspls to
> > encapsulate the DRM specific part.
>
>
> You have to use sds as this is the framework where the application process
> learns its name. That framework will be receiving more responsibilities in
> the revised implementation, so you'll unfortunately have to use it. Your
> best bet (IMHO) would be to create an lsf_farm component in the new PLM when
> we get the system revised.
>

sounds right -- some things will of course depend on when i need what
working where. but if possible i'll try to get in too deep before some
of these design changes are in. any hints on a timeline?

> >
> > hmm, i'm thinking that if there was a way to directly tell open-rte to acquire
> > more daemons non-blockingly, that would be enough.
> > in the LSF case, i think one would bsub the daemons themselves (with arguments
> > sufficient to phone-home, so no sds needed?), so (node acquisition == daemon
> > startup).
>
>
> You could - though this sounds pretty non-scalable to me.
>

hmm, in what way? my impression is that you need to go back to an LSF
queue on every request for new resources from LSF. there might be some
way to give a (variably) higher priority to running jobs, but it's
still going to require a bsub()/lsb_submit() (or similar API) to get
new resources. otherwise, it defeats the queuing / job control system.
i think.

> >
> > this functions could be called heuristically by MPI-2 spawn type functions, or
> > even manually by the application (in the short term). it should not effect the
> > semantics of the MPI-2 calls themselves.
>
>
> Your best bet would be to have your own component so that you could do
> whatever you wanted with the spawn API. You could play with an RMGR
> component for now, but your best bet is clearly going to be the new PLM.
>

sounds right. same potential timeline issues as above of course.

> >
> > the goal is that one could determine (at least with some confidence) if there
> > were any free (and ready to spawn quickly without blocking) resources before
> > issuing a spawn call. this might just mean examining the value of the MPI
> > universe size (and that this value could change), or it might need some new
> > interface, i dunno.
>
>
> You know, the real issue here (I think) is being driven by your use of bsub
> - which I believe is a batch launch request. Why would you want to do that
> instead of just directly calling lsb_launch()? I suspect we can get the
> Platform folks to give us an API to request additional node allocations from
> inside our program, so why not just use the API to launch? Or are you going
> the batch route because we don't currently have an API and you want to
> support older LSF versions?
> Might be more pain than it's worth...
>

well, certainly part of the issue is the need (or at least strong
preference) to support 6.2 -- but read on.

hmm, i'll need to review the APIs in more detail, but here is my
current understanding:
there appear to be some overlaps between the ls_* and lsb_* functions,
but they seem basically compatible as far as i can tell. almost all
the functions have a command line version as well, for example:
lsb_submit()/bsub

lsb_getalloc()/none and lsb_launch()/blaunch are new with LSF 7.0, but
appear to just be a different (simpler) interface to existing
functionality in the LSB_* env vars and the ls_rexec()/lsgrun commands
-- although, as you say, perhaps platform will hook or enhance them
later. but, the key issue is that lsb_launch() just starts tasks -- it
does not perform or interact with the queue or job control (much?).
so, you can't use these functions to get an allocation in the first
place, and you have to be careful not to use them as a way around the
queuing system.

[ as a side note, the function ls_rexecv()/lsgrun is the one i have
heard admins do not like because it can break queuing/accounting, and
might try to disable somehow. i don't really buy that, because it's
not you can disable it and have the system still work, since (as
above) || job launching depends on it. i guess if you really don't
care about || launching maybe you could. but, if used properly after a
proper allocation i don't think there should (or even can) be a
problem. ]

so, lsb_submit()/bsub is a combination allocate/launch -- you specify
the allocation size you want, and when it's all ready, it runs the
'job' (really the job launcher) only on one (randomly chosen) 'head'
node from the allocation, with the env vars set so the launcher can
use ls_rexec/lsgrun functions to start the rest of the job. there are
of course various script wrappers you can use (mpijob, pvmjob, etc)
instead of your 'real job'. then, i think lsf *should* try to track
what processes get started via the wrapper / head process so it knows
they are part of the same job. i dunno if it really does that -- but,
my guess is that at the least it assumes the allocation is in use
until the original process ends. in any case, the wrapper / head
process examines the environment vars and uses ls_rexec()/lsgrun or
the like to actually run N copies of the 'real job' executable. in
7.0, it can conveniently use lsb_getalloc() and lsb_launch(), but that
doesn't really change any semantics as far as i know. one could
imaging that calling lsb_launch() instead of ls_rexec() might be
preferable from a process tracking point of view, but i don't see why
Platform couldn't hook ls_rexec() just as well as lsb_launch().

i really need to get a little more confidence on that issue, since
it's what determines what actions will (or perhaps already do in
practice) 'break' the queuing/reporting system.

there are some 'allocate only' functions as well, such as
ls_placereq()/lsplace -- these can just return a host list / set the
env vars without running anything at first. apparently, you need to
run something 'soon' on the resultant hosts or the load balancer might
get confused and reuse them. also, since this doesn't seem to go
through the queues, it's probably not a viable set of functions to
really use. a red herring, as far as i'm concerned.

there is also an lsb_runjob() that is similar to lsb_launch(), but for
an already submitted job. so, if one were to lsb_sumbit() with an
option set to never launch it automatically, and then one were to run
lsb_runjob(), you can avoid the queue and/or force the use of certain
hosts? i guess this is also not a good function to use, but at least
the queuing system would be aware of any bad behavior (queue skipping
via ls_placereq() to get extra hosts, for instance) in this case ...

there does *not* appear to be an option to lsb_submit() that allows a
non-blocking programmatic callback when allocation is complete. if
there was, it would need to deal with process tracking issues, or
maybe just merge the old and new jobs somehow in that case.

so to speak to the original point, it would indeed be nice to be able
to do additional allocations (and then an lsb_launch) with a simple
programmatic interface for completeness, but i don't see one. however,
lsb_submit() is pretty close -- it makes a 'new' job, but i think
that's okay. the initial daemon that gets run on the 'head' (i.e.
randomly chosen) node of the new job will run an lsb_launch() or
similar to start up the remaining N-1 daemons as children -- thus
hopefully keeping the queuing system and process tracking happy. or
you could use some LSF option / wrapper script to tell it to run the
same daemon on all N hosts for you, if a some suitable option/wrapper
exists anyway. so, in summary lsb_sumit() does allocation + one
(non-optional) launch on allocation completion. lsb_launch() (or
similar) does only launching, should probably only be run from the
single process started from an lsb_submit(), and should only launch
things on the allocation given by lsb_getalloc() (or env vars).

Matt.