Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Matthew Moskewicz (moskewcz_at_[hidden])
Date: 2007-07-14 16:02:47

hi everyone,

firstly, i'm new around here, and somewhat clueless when it comes to the
details of working with an big autoconfiscated project like
open-rte/open-mpi the svn checkout level ...

i've read some of the archives that turned up in searches for terms like
'LSF', and it would seem there was some discussion about adding some form of
LSF support to open-rte, but that the discussion ended a while back. so,
after playing around with the 1.2.3 release tarball for a while, and
reading various pieces of the code until i had a (vague) idea of the
top-level control flow and such, i decided i was ready to try to add ras and
pls component to support LSF. once i had the build system up, i tried to
create an ras/lsf directory, and slightly to my surprise, it already
existed. i was kinda hoping for that, but it appears to be *very* fresh code
at the moment. nonetheless, i played around a bit more, and ran into two

1) it appears that you (jeff, i guess ;) are using new LSF 7.0 API features.
i'm working to support customers in the EDA space, and it's not clear
if/when they will migrate to 7.0 -- not to mention that our company
(cadence) doesn't appear to have LSF 7.0 yet. i'm still looking in to the
deatils, but it appears that (from the Platform docs) lsb_getalloc is
probably just a thin wrapper around the LSB_MCPU_HOSTS (spelling?)
environment variable. so that could be worked around fairly easily. i dunno
about lsb_launch -- it seems equivalent to a set of ls_rtask() calls (one
per process). however, i have heard that there can be significant subtleties
with the semantics of these functions, in terms of compatibility across
differently configured LSF-controlled farms, specifically with regrads to
administrators ability to track and control job execution. personally, i
don't see how it's really possible for LSF to prevent 'bad' users from
spamming out jobs or short-cutting queues, but perhaps some of the methods
they attempt to use can complicate things for a library like open-rte.

2) this brings us to point 2 -- upon talking to the author(s) of cadence's
internal open-rte-like library, several key issues were raised. mainly,
customers want their applications to be 'farm-friendly' in several key ways.
firstly, they do not want any persistent daemons running outside of a given
job -- this requirement seems met by the current open-mpi default behavior,
at least as far i can tell. secondly, they prefer (strongly) that
applications acquire resources incrementally, and perform work with whatever
nodes are currently available, rather than forcing a large up-front node
allocation. fault tolerance is nice too, although it's unclear to me if it's
really practically needed. in any case, many of our applications can
structure their computation to use resources in just such a way, generally
by dividing the work into independent, restartable pieces (i.e. they are
embarrassingly ||). also, MPI communication + MPI-2 process creation seems
to be a reasonable interface for handling communication and dynamic process
creation on the application side. however, it's not clear that open-rte
supports the needed dynamic resource acquisition model in any of the ras/pls
components i looked at. in fact, other that just folding everything in the
pls component, it's not clear that the entire flow via the rmgr really
supports it very well. specifically for LSF, the use model is that the
initial job either is created with bsub/lsb_submit(), (or automatically
submits itself as step zero perhaps) to run initially on N machines. N
should be 'small' (1-16) -- perhaps only 1 for simplicity. then, as the
application runs, it will continue to consume more resources as limited by
the farm status, the user selection, and the max # of processes that the job
can usefully support (generally 'large' -- 100-1000 cpus).

so, i figure it's up to me to implement this stuff ;) ... clearly, i want to
keep the 'normal' style ras/pls for LSF working, but somehow add the dynamic
behavior as an option. my initial thought was to (in the dynamic case)
basically ignore/fudge the ras/rmaps(/pls?) stages and simply use
bsub/lsb_submit() in pls to launch new daemons as needed/requested. again,
though it's not clear that the current control flow supports this well.
given that there may be a large (10sec - 15min) delay between lsb_submit()
and job launch, it may be necessary to both acquire minimum size blocks of
new daemons at a time, and to have some non-blocking way to perform
spawning. for example, in the current code, the MPI-2 spawn is blocking
because it needs to return a communicator to the spawned process. however,
this is not really necessary for the application to continue -- it can
continue with other work until the new worker is up and running. perhaps
some form of multi-threading could help with this, but it's not totally
clear. i think i would prefer some lower-level open-rte calls that perform
daemon pre-allocation (i.e. dynamic ras/daemon startup), such that i know
that if there are idle daemons, it is safe to spawn without risk of

oh, and at first glance there appears to be a bunch of duplicated code
across the various flavors of ras (and similarly for pls, sds). is it
reasonable to attempt to factor things out? i seem to recall reading that
some major rework was in progress, so perhaps this would not be a good time?

uhm ... well, any advice on anything here?