I made a major commit to the trunk this morning (r15007) that merits general
notification and some explanation.
*** IMPORTANT NOTE ***
One major impact of the commit you *may* notice is that support for several
environments will be broken. This commit is known to break support for the
following environments: POE, Xgrid, Xcpu, Windows - these environments will
not compile at this time. It has been tested on rsh, SLURM, and Bproc.
Modifications for TM support have been made but could not be verified due to
machine problems at LANL. Modifications for SGE have been made but could not
be verified. I will send out a separate note to developers of the borked
environments with suggestions on how to fix the problems. These should be
relatively minor, mostly involving a minor change to a couple of function
calls and the addition of one function call in their respective launch
As many of you have noted, the ORTE launch procedure relies heavily on the
orte_rml.xcast function to communicate occasionally large messages to every
process in a job. This procedure has - until now - been a linear
communication that sent the messages directly to every process. Obviously,
as many of you have pointed out, this was a very inefficient methodology.
This commit repairs that problem, but it comes with a few side effects. You
shouldn't notice anything different (except hopefully for faster starts),
but I will note the differences here.
First, orte_rml.xcast has become a general broadcast-like messaging system.
Messages can now be sent to any tag on the daemons or processes. Note that
any message sent via xcast will be delivered to ALL processes in the
specified job - you don't get to pick and choose. At a later date, we will
introduce an augmented capability that will use the daemons as relays, but
will allow you to send to a specified array of process names.
We also modified orte_rml.xcast so it supports more scalable message routing
methodologies. At the moment, we support three:
(a) direct, which sends the message directly to all recipients. By default,
this mode is used whenever we have less than 10 daemons. You can adjust that
crossover point via the oob_xcast_linear_xover param - set this param to the
number of daemons where you want direct to give way to linear. Obviously, if
you set this to something very large, then we will only use direct xcast
mode - set it to zero, and we won't use direct at all. Alternatively, you
can force the use of direct at all scales by setting oob_xcast_mode to
(b) linear, which sends the message to the local daemon on each node. The
daemon then relays it to its own local procs. Note that the daemons in this
mode do not relay the message between themselves, but only to their local
procs. As per a prior message, I have set linear to be the default mode on
all jobs involving more than 10 daemons. Again, you can adjust this by
setting a lower bound on where linear kicks in (as described above). You can
also set an upper bound where linear gives way to binomial by setting the
oob_xcast_binomial_xover param. Alternatively, you can force the use of
linear at all scales by setting oob_xcast_mode to "linear".
(c) binomial, which sends the message via a binomial algo across all the
daemons, each of which then relays to its own local procs. This is just a
typical binomial algorithm across the daemons. At this time, I have set the
default on this mode to be "off" so it will never kick on. If you want to
try it out, you will need to either adjust the xover param (as described
above), or set oob_xcast_mode to "binomial".
Please note that we *do* use the direct messaging mode whenever there is
only one daemon in the system. This is non-negotiable - it is mandated for
singletons and for getting mpirun up and running. Besides, if there is only
one daemon in the system, every message goes "direct" no matter which mode
you pick, so you shouldn't care. ;-)
Also note that the current crossover points were totally arbitrary. I have
no data to base those crossovers on, so I simply picked something for now.
If those of you with access to larger systems and with some free time could
try various values, then we could come up with something more intelligent.
Any data would be most appreciated!
This commit also involved a significant change to the orteds themselves. The
requirement that orteds *always* be available to relay messages mandated a
change in the way they come alive. In the past, orteds bootstrapped
themselves in two totally different code paths: bootproxy or VM. This is no
longer supported. Orteds now always behave like they are part of a virtual
machine - they simply launch a job if mpirun tells them to do so. This is
another step towards creating an "orteboot" functionality, but also provided
a clean system for supporting message relaying.
Note one major impact of this commit: multiple daemons on a node cannot be
supported any longer! Only a single daemon/node is now allowed. You
shouldn't notice any difference as this was always transparent. However, if
you are working in an environment where daemons occupied job slots, you
should see some benefit.
Please let me know of any problems. I did my best to test this, but there
will undoubtedly be some problems that crop up, and some code paths that are
borked that I didn't see on any of my available machines or in my