I understand that several people are interested in the OpenRTE scalability
issues - this is great! However, it appears we haven't done a very good job
of circulating information about the identified causes of the current
issues. In the hope of helping people to be productive in their
contributions, I thought it might be useful if we circulated both the info
and diagnoses to-date, as well as current remediation plans that have been
developed by those of us working on the issues so far and the status of
First, a quick recap so everyone starts from a common knowledge base. We
have performed roughly 4 scalability tests on OpenRTE/Open MPI over the last
two years. In each case, we had exclusive use of a large cluster so that we
could run large scale jobs - typically consisting of 500+ nodes and up to 8K
processes, operating under either SLURM or TM environments. We have also
received some scaling data from efforts at Sun involving Solaris-based
systems running under N1GE. The tests showed we could reliably launch to
about the 1K process level, but we encountered difficulties when extending
significantly beyond that point.
The scalability issues generally breakdown into two categories:
1. Memory consumption. We see a "spike" in memory usage by the HNP early in
the launch process that can be quite large. In the earliest tests, we saw
GBytes consumed during the launch of ~2K processes, with a steady-state
usage of ~20MBytes. The spike was caused by the copying of buffers during
transmission of OOB messages, combined with the large size of the STG1 xcast
message. This was corrected at the time (courtesy of Tim W) by having the
OOB *not* copy message buffers. Follow-on tests showed that the memory
"spike" had essentially disappeared.
However, recent tests indicate that this "fix" may have been lost, or we may
now be using a code path that bypasses it (we used to send the xcast
messages via blocking sends, but now use non-blocking sends, which do follow
a slightly different code path). Regardless of the reason, it appears that
the copying of buffers has returned, and OpenRTE once again exhibits the
GByte memory "spike" on large jobs.
Steady-state memory usage is driven by two things. First, we made a design
decision at the beginning of the project to provide maximum system
flexibility. Hence, there is no overarching control over the data being
stored within the system (specifically, within the GPR framework), and each
component/framework is free to store whatever its author wants. Given the
free-lance nature of the development of these components, there is some
non-trivial duplication of information on the GPR. However, if you add all
that up, it doesn't amount to a very large number (on the order of a few
megabytes for large scales). On machines of interest to those of us working
on the code at the time, the steady-state memory footprint was not a
priority issue - hence, little has been done to reduce it.
2. OOB communications. There are two primary issues in this category, both
of which lead back to the same core problem. First, the number of TCP socket
connections back to the HNP grows linearly with the number of processes. In
the most recent tests, Galen reported ~20K sockets being opened on the HNP
for an 8K process job running on 4K nodes. If you look at the code, you will
find that (a) 8K of those sockets are due to each MPI process connecting
directly back to the HNP, and (b) 4K of those sockets are due to the orteds
on each node connecting back to the HNP. The other 8K sockets appear to be
due to a "bug" in the code: from what I can tell so far, it appears that
either the MPI layer's BTL/TCP component is opening a socket to the HNP,
even though the HNP isn't actually part of the MPI job itself, or the
processes are opening duplicate OOB sockets back to the HNP. I am not
certain which (or either) of these is the root cause, however - it needs
further investigation to identify the source of the extra sockets.
The large number of sockets on the HNP causes timeout problems during the
initial "connection storm" as processes start up, which subsequently causes
the MPI job to go into "retry hell". To help relieve that problem, a
"listener thread" was introduced on the HNP (courtesy of Brian) that could
absorb the connections at a much higher rate. This has now been debugged in
the current OMPI trunk (and I believe was just moved to 1.2.1 for release).
The second issue is the time it takes to transmit the various stage gate
messages from the HNP to each MPI process. The only stage gate of concern
here is STG1 since that is where substantial data is involved (we send info
required to inform each process of its peers for interconnect purposes). The
current OMPI trunk uses a "direct" method - i.e., the HNP sends the stage
gate messages directly to each MPI process.
We have already implemented two measures to help reduce this part of the
problem. First, late last year we revised the GPR notification message
system to allow subscribers to eliminate (or at least drastically reduce)
descriptive information sent with the message. We also changed the buffer
packing system to likewise eliminate data type descriptions. This succeeded
in significantly reducing the *size* of the message itself.
Second, we implemented a routed xcast messaging system that sends the stage
gate messages through the local orted. Thus, the *number* of messages being
sent dropped by a factor equal to the number of processes/node.
These two measures (reducing the message size and routing it through the
orted) had the effect of chopping the stage gate time in more than half (for
our 2ppn test machines). The message size changes are in the OMPI trunk and
in 1.2 - however, the xcast routing code remains solely in the ORTE trunk.
In addition, the ORTE trunk contains code for each MPI process to create a
connection to its local orted - this does not currently exist within the
OMPI trunk or 1.2 release.
The remediation plans currently underway primarily focus on the OOB as this
appears to be the central figure in both observed issues. The primary effort
is aimed at creating a general message routing capability for the OOB.
Several of us have discussed various design options - as things stand, I owe
that group of people a draft design document (which I'm late in delivering).
A routable OOB would yield several immediate benefits. First, it would
significantly reduce the connection storm problem and provide a more
scalable connection plan for the HNP. Second, it would reduce the memory
"spike" since the HNP would be generating far fewer messages. And finally,
it would reduce the xcast transmission (assuming the routing algorithm
includes some type of tree-like structure).
The secondary effort is aimed at removing the copying of message buffers
within the OOB. We have two issues inside this area. First, Tim W originally
copied the buffers for protective purposes - e.g., since the OOB queues
messages, a non-blocking caller could release the buffer prior to the OOB
actually sending it. The simplest method of protection was to have the OOB
retain its own copy, thus ensuring it was always there when transmission
actually occurred. Of course, at the time we were only worrying about
launching jobs of a few hundred processes - not thousands. ;-) This practice
probably needs to be reviewed in terms of future requirements, although we
have to be careful since portions of the code may *rely* now on this
One possible alternative that has been discussed is to create a new
"send_multiple" API that allows us to pass a single buffer along with an
array of recipients to the OOB. In this case, even though the buffer is
being copied, it would only be copied one time since the OOB would control
that memory copy while cycling through all the recipients. I'm not sure if
this is the correct approach, but it may merit some thought.
My expectation is that the current ORTE trunk's xcast routing and
orted-to-local-proc connection codes will move over to the OMPI trunk at
some time in the near future (awaiting release of the current OMPI trunk
"freeze" called to last until 1.2.1 gets out). The OOB revisions have no
real schedule at this time - however, both code changes were targeted for
the 1.3 release (and specifically were *not* to be released in the 1.2
series, per the Dec decision).
I hope that helps provide some food for thought. Feel free to ask questions
- so far, the discussions have involved several people, so you are welcome
to just hit the mailing lists.