Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-06-12 16:49:12


Sounds good Ralph; thanks!

On Jun 12, 2007, at 9:54 AM, Ralph H Castain wrote:

> Yo all
>
> I made a major commit to the trunk this morning (r15007) that
> merits general
> notification and some explanation.
>
> *** IMPORTANT NOTE ***
> One major impact of the commit you *may* notice is that support for
> several
> environments will be broken. This commit is known to break support
> for the
> following environments: POE, Xgrid, Xcpu, Windows - these
> environments will
> not compile at this time. It has been tested on rsh, SLURM, and Bproc.
> Modifications for TM support have been made but could not be
> verified due to
> machine problems at LANL. Modifications for SGE have been made but
> could not
> be verified. I will send out a separate note to developers of the
> borked
> environments with suggestions on how to fix the problems. These
> should be
> relatively minor, mostly involving a minor change to a couple of
> function
> calls and the addition of one function call in their respective launch
> functions.
>
>
> As many of you have noted, the ORTE launch procedure relies heavily
> on the
> orte_rml.xcast function to communicate occasionally large messages
> to every
> process in a job. This procedure has - until now - been a linear
> communication that sent the messages directly to every process.
> Obviously,
> as many of you have pointed out, this was a very inefficient
> methodology.
>
> This commit repairs that problem, but it comes with a few side
> effects. You
> shouldn't notice anything different (except hopefully for faster
> starts),
> but I will note the differences here.
>
> First, orte_rml.xcast has become a general broadcast-like messaging
> system.
> Messages can now be sent to any tag on the daemons or processes.
> Note that
> any message sent via xcast will be delivered to ALL processes in the
> specified job - you don't get to pick and choose. At a later date,
> we will
> introduce an augmented capability that will use the daemons as
> relays, but
> will allow you to send to a specified array of process names.
>
> We also modified orte_rml.xcast so it supports more scalable
> message routing
> methodologies. At the moment, we support three:
>
> (a) direct, which sends the message directly to all recipients. By
> default,
> this mode is used whenever we have less than 10 daemons. You can
> adjust that
> crossover point via the oob_xcast_linear_xover param - set this
> param to the
> number of daemons where you want direct to give way to linear.
> Obviously, if
> you set this to something very large, then we will only use direct
> xcast
> mode - set it to zero, and we won't use direct at all.
> Alternatively, you
> can force the use of direct at all scales by setting oob_xcast_mode to
> "direct".
>
> (b) linear, which sends the message to the local daemon on each
> node. The
> daemon then relays it to its own local procs. Note that the daemons
> in this
> mode do not relay the message between themselves, but only to their
> local
> procs. As per a prior message, I have set linear to be the default
> mode on
> all jobs involving more than 10 daemons. Again, you can adjust this by
> setting a lower bound on where linear kicks in (as described
> above). You can
> also set an upper bound where linear gives way to binomial by
> setting the
> oob_xcast_binomial_xover param. Alternatively, you can force the
> use of
> linear at all scales by setting oob_xcast_mode to "linear".
>
> (c) binomial, which sends the message via a binomial algo across
> all the
> daemons, each of which then relays to its own local procs. This is
> just a
> typical binomial algorithm across the daemons. At this time, I have
> set the
> default on this mode to be "off" so it will never kick on. If you
> want to
> try it out, you will need to either adjust the xover param (as
> described
> above), or set oob_xcast_mode to "binomial".
>
> Please note that we *do* use the direct messaging mode whenever
> there is
> only one daemon in the system. This is non-negotiable - it is
> mandated for
> singletons and for getting mpirun up and running. Besides, if there
> is only
> one daemon in the system, every message goes "direct" no matter
> which mode
> you pick, so you shouldn't care. ;-)
>
> Also note that the current crossover points were totally arbitrary.
> I have
> no data to base those crossovers on, so I simply picked something
> for now.
> If those of you with access to larger systems and with some free
> time could
> try various values, then we could come up with something more
> intelligent.
> Any data would be most appreciated!
>
> This commit also involved a significant change to the orteds
> themselves. The
> requirement that orteds *always* be available to relay messages
> mandated a
> change in the way they come alive. In the past, orteds bootstrapped
> themselves in two totally different code paths: bootproxy or VM.
> This is no
> longer supported. Orteds now always behave like they are part of a
> virtual
> machine - they simply launch a job if mpirun tells them to do so.
> This is
> another step towards creating an "orteboot" functionality, but also
> provided
> a clean system for supporting message relaying.
>
> Note one major impact of this commit: multiple daemons on a node
> cannot be
> supported any longer! Only a single daemon/node is now allowed. You
> shouldn't notice any difference as this was always transparent.
> However, if
> you are working in an environment where daemons occupied job slots,
> you
> should see some benefit.
>
> Please let me know of any problems. I did my best to test this, but
> there
> will undoubtedly be some problems that crop up, and some code paths
> that are
> borked that I didn't see on any of my available machines or in my
> configurations.
>
> Ralph
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems