Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-06-12 20:58:20

Yo all

Over the last 12-18 months, several of us (both inside and outside the Open
MPI community) have discussed a variety of methods for making OpenRTE
considerably faster - i.e., changes that would decrease launch times by at
least one order of magnitude. While we documented the results, our general
feeling has been to hold off from any implementation as the required changes
would have compromised some features that users outside the Open MPI
community might have wished to exploit.

In recent months, however, the non-Open MPI users have largely decided to
pursue other options. There are a couple of reasons for this, but they are
irrelevant to this discussion. What is relevant is that with the departure
of those interests, there no longer is a valid reason for not streamlining
the system. I have discussed this situation with several members of the Open
MPI community, and the strong consensus was to go ahead with the necessary

The changes will cost us a slight decrease in flexibility and programmer
friendliness, but preliminary estimates show a potential decrease in launch
time of roughly 20x at scale. The cost, therefore, seems worth the gain.

The changes primarily revolve around the use of the GPR. Let me make
something clear right away - it is *not* the GPR itself that is the cause of
the slowdown, but rather the way we utilize it and the secondary impacts
that result from those choices. Yes, the GPR *will* also see a major
increase in the speed with which it processes requests, but the primary
benefits will come from other areas in the code.

The primary change involves replacement of the character string keys used to
label data with uint8_t's. The immediate impact of this change is to reduce
the size of the STG1 stage gate message - the primary rate limiting factor
in today's launch procedure - by a factor of approximately 15-20x. It means,
however, that keys will now have to be defined in a central location (you
won't just be able to declare a new string in your component and use it). We
will retain some flexibility by extending the name service to support
dynamic key definition ala the current RML tag service. We expect, though,
that all ORTE standard keys will be defined in a new orte_schema.h file to
avoid speed impacts of registering dynamic keys (especially on remote

This change also allows us to eliminate all dictionary functions from the
GPR, replacing them by simply using the key as a direct index into the GPR
storage arrays. This has the immediate benefit of greatly simplifying the
GPR internal code (e.g., the search code becomes a simple array index) and
provides a corresponding increase in speed. Similarly, GPR segments will
also become simple numeric indices. Tokens that were used to identify
containers on a given segment will be replaced by numeric indices as well -
for job segments, the index will simply be the vpid of each process. On the
job-master segment, the container index will be the jobid.

On the node segment, the container index will be the nodeid - a numeric
equivalent to each node's character string name. We will assign a numeric id
to every node as we allocate it, and use that id in place of the current
string nodenames. For those of you that want the string nodename in the proc
structures for debugging and user-friendly error messages, we will provide
that info based on an MCA param (either the current one or a slight variant
- remains TBD). This will allow you to assess the performance impact of
retaining those nodenames. Meantime, ORTE itself will be converted to use
the node id for reduced communications and more efficient interface to the

Finally, we will further reduce the size of the STG1 message (and any other
stage gate messages) by compressing the data stream. First, we will remove
the current system of indexing data using process names - the GPR will
ensure that data is returned in a container-ordered array. Thus, we can know
for certain that the data from each container is being provided to us in
sequential order without having to include some index such as a process or
node name.

Second, we will remove duplication of data across subscriptions going to a
specific process. The current system simply sends the data requested by each
subscription without worrying about any duplications between subscriptions.
Hence, we send multiple copies of node names, process names, and other
information across the wire as part of the STG1 message. As part of a later
stage to this planned change, we will compress that information by dealing
with duplication at the local level - i.e., the GPR proxy will maintain a
record of duplicate data requests, a single copy of each data element will
be sent, and the GPR proxy will deal with the duplication at its end.

These changes will be implemented in several phases on a tmp branch. Each
phase will be tested across several environments and then brought over to
the trunk. The first phase will be the most intrusive as it will involve the
conversion from string to numeric keys, along with the corresponding changes
to the GPR. I hope to complete this phase sometime in early July.

Please feel free to offer comments, suggestions, or - if so inclined -
assistance. I'll keep the community updated on progress as we go.