Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-07-14 17:15:54

Welcome! Yes, Jeff and I have been working on the LSF support based on 7.0
features in collab with the folks at Platform.

Some further comments below...


On 7/14/07 2:02 PM, "Matthew Moskewicz" <moskewcz_at_[hidden]>

> hi everyone,
> firstly, i'm new around here, and somewhat clueless when it comes to the
> details of working with an big autoconfiscated project like open-rte/open-mpi
> the svn checkout level ...
> i've read some of the archives that turned up in searches for terms like
> 'LSF', and it would seem there was some discussion about adding some form of
> LSF support to open-rte, but that the discussion ended a while back. so, after
> playing around with the 1.2.3 release tarball for a while, and reading
> various pieces of the code until i had a (vague) idea of the top-level
> control flow and such, i decided i was ready to try to add ras and pls
> component to support LSF. once i had the build system up, i tried to create an
> ras/lsf directory, and slightly to my surprise, it already existed. i was
> kinda hoping for that, but it appears to be *very* fresh code at the moment.
> nonetheless, i played around a bit more, and ran into two issues:
> 1) it appears that you (jeff, i guess ;) are using new LSF 7.0 API features.
> i'm working to support customers in the EDA space, and it's not clear if/when
> they will migrate to 7.0 -- not to mention that our company (cadence) doesn't
> appear to have LSF 7.0 yet. i'm still looking in to the deatils, but it
> appears that (from the Platform docs) lsb_getalloc is probably just a thin
> wrapper around the LSB_MCPU_HOSTS (spelling?) environment variable. so that
> could be worked around fairly easily. i dunno about lsb_launch -- it seems
> equivalent to a set of ls_rtask() calls (one per process). however, i have
> heard that there can be significant subtleties with the semantics of these
> functions, in terms of compatibility across differently configured
> LSF-controlled farms, specifically with regrads to administrators ability to
> track and control job execution. personally, i don't see how it's really
> possible for LSF to prevent 'bad' users from spamming out jobs or
> short-cutting queues, but perhaps some of the methods they attempt to use can
> complicate things for a library like open-rte.

After lengthy discussions with Platform, it was deemed the best path forward
is to use the lsb_getalloc interface. While it currently reads the enviro
variable, they indicated a potential change to read a file instead for
scalability. Rather than chasing any changes, we all agreed that using
lsb_getalloc would remain the "stable" interface - so that is what we used.

Similar reasons for using lsb_launch. I would really advise against making
any changes away from that support. Instead, we could take a lesson from our
bproc support and simply (a) detect if we are on a pre-7.0 release, and then
(b) build our own internal wrapper that provides back-support. See the bproc
pls component for examples.

> 2) this brings us to point 2 -- upon talking to the author(s) of cadence's
> internal open-rte-like library, several key issues were raised. mainly,
> customers want their applications to be 'farm-friendly' in several key ways.
> firstly, they do not want any persistent daemons running outside of a given
> job -- this requirement seems met by the current open-mpi default behavior, at
> least as far i can tell. secondly, they prefer (strongly) that applications
> acquire resources incrementally, and perform work with whatever nodes are
> currently available, rather than forcing a large up-front node allocation.
> fault tolerance is nice too, although it's unclear to me if it's really
> practically needed. in any case, many of our applications can structure their
> computation to use resources in just such a way, generally by dividing the
> work into independent, restartable pieces ( i.e. they are embarrassingly ||).
> also, MPI communication + MPI-2 process creation seems to be a reasonable
> interface for handling communication and dynamic process creation on the
> application side. however, it's not clear that open-rte supports the needed
> dynamic resource acquisition model in any of the ras/pls components i looked
> at. in fact, other that just folding everything in the pls component, it's not
> clear that the entire flow via the rmgr really supports it very well.
> specifically for LSF, the use model is that the initial job either is created
> with bsub/lsb_submit(), (or automatically submits itself as step zero
> perhaps) to run initially on N machines. N should be 'small' (1-16) -- perhaps
> only 1 for simplicity. then, as the application runs, it will continue to
> consume more resources as limited by the farm status, the user selection, and
> the max # of processes that the job can usefully support (generally 'large' --
> 100-1000 cpus).

OpenRTE will be undergoing some changes shortly, so I would strongly
recommend you avoid making major changes without first discussing how they
fit into the new design with us. While cadence is a nice system, there are
tradeoffs in every design approach - and it isn't clear that theirs is
necessarily any better than another.

We could argue for quite some time about their beliefs regarding customers
desires - I have heard these statements in multiple directions, with people
citing claims of customer "demands" pointing every which way. Bottom line,
from what I can tell, is that customers want something that works and is
transparent to them - how that is done is largely irrelevant.

We have other people working on dynamic resource allocation for other
systems (e.g., TM), and are making some modifications to better support that
kind of requirement. We can discuss those with you if you like to see how
they meet your needs. Not much was done in the past in that regard because
people weren't interested in it. Frankly, we are somewhat moving in the
other direction now, so supporting it in the manner you describe may
possibly become harder rather than easier. You may have to accept some
less-than-ideal result, I fear.

> so, i figure it's up to me to implement this stuff ;) ... clearly, i want to
> keep the 'normal' style ras/pls for LSF working, but somehow add the dynamic
> behavior as an option. my initial thought was to (in the dynamic case)
> basically ignore/fudge the ras/rmaps(/pls?) stages and simply use
> bsub/lsb_submit() in pls to launch new daemons as needed/requested.

Just an FYI: this could cause unexpected behavior in the current
implementation as a number of subsystems depend upon the info coming from
those stages. May not be as big a problem in the revised implementation
currently underway.

> again,
> though it's not clear that the current control flow supports this well. given
> that there may be a large (10sec - 15min) delay between lsb_submit() and job
> launch, it may be necessary to both acquire minimum size blocks of new daemons
> at a time, and to have some non-blocking way to perform spawning. for example,
> in the current code, the MPI-2 spawn is blocking because it needs to return a
> communicator to the spawned process.

Actually, that is not the real reason. It is blocking because the parent
wants to send a message to the new children telling them where/how to
rendezvous with it. The problem is that the parent doesn't know the name of
the child until after the spawn is completed - because we need the child's
OOB contact info so we can send the message. The easiest way to ensure that
all the handshakes occurred correctly was to simply make comm_spawn

Given that comm_spawn in our current environments is relatively fast, that
was deemed to be an acceptable solution. Obviously, your stated time frames
are much, much longer, so that might not work in those cases.

It would be easier to change it under the revised implementation, which will
better support that kind of difference between environments. In the current
one, it could be quite problematic.

> however, this is not really necessary for
> the application to continue -- it can continue with other work until the new
> worker is up and running. perhaps some form of multi-threading could help with
> this, but it's not totally clear. i think i would prefer some lower-level
> open-rte calls that perform daemon pre-allocation ( i.e. dynamic ras/daemon
> startup), such that i know that if there are idle daemons, it is safe to spawn
> without risk of blocking.

I'll have to leave that up to the MPI folks on the team - we have
historically resisted the idea of having one environment behave differently
from another so as to limit "user astonishment". However, if they can live
with that change, I personally have no problem with it.

We just made a significant change to daemon launch procedures, and the flow
between the stages is going to be completely revamped over the next few
months. How that affects your thinking is unclear to me at the moment, but
might be worth further discussion.

Just as an FYI: we already check to see if there are available daemons, and
we do spawn upon them if so. The issue here sounds like it is more in
obtaining a larger-than-immediately-needed allocation, and spawning daemons
on all of it just-in-case they are needed. There is nothing in the system
that precludes doing so - we made a design decision early on not to do it,
but that's not a requirement. Again, the revised implementation would let
you do that much easier than the current one.

> oh, and at first glance there appears to be a bunch of duplicated code across
> the various flavors of ras (and similarly for pls, sds). is it reasonable to
> attempt to factor things out? i seem to recall reading that some major rework
> was in progress, so perhaps this would not be a good time?

Definitely not a good time - I would wait awhile and let's see how much of
it remains. Some of it is there because of historical uncertainty over what
would be common and what wouldn't be - some might be there for a reason
known to the original author. I would advise asking before assuming...

> uhm ... well, any advice on anything here?
> thanks,
> Matt.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]