Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2006-12-04 08:26:26

Hello all

If you are interested in the ongoing scalability work, or in the RML/OOB in
ORTE, please read on - otherwise, feel free to hit "delete".

As many of you know, we have been working towards solving several problems
that affect our ability to operate at large scale. Some of the required
modifications to the code base have recently been applied to the trunk.

We have known since it was originally written over two years ago that the
OOB contained some inherent scalability limits. For example, the system
immediately upon opening obtains contact info for all daemons in the
universe, opens sockets to them, and sends an initial message to them. It
then does the same with all the application processes in its job.

As a result, for a 2000 process job running on 500 nodes, each application
process will immediately open and communicate across 2501 sockets (2000
procs + 500 daemons [one per node] + the HNP) during the startup phase.

If you really want to imagine some fun, now have that job comm_spawn 500
processes across the 500 nodes, and *don't* reuse daemons. As each new
daemon is spawned, every process in the original job (including the original
daemons) is notified, loads the new contact info for that daemon, opens a
socket to it, and does an "ack" comm. After all 500 new daemons are running,
they now launch the 500 new procs, each of which gets the info on 1000
daemons plus the info for 2000 parents and 500 peers, and immediately opens
1000 daemons + 2000 parents + 500 peers + 1 HNP = 3501 sockets!

This was acceptable for small jobs, but causes considerable delay during
startup for large jobs. A few other OOB operational characteristics further
exacerbate the problem - I will detail those in a document on the wiki to
help foster greater understanding.

Jeff Squyres and I are about to begin a major revision of the RML/OOB code
to resolve these problems. We will be using a staged approach to the effort:

1. separate the OOB's actions for loading contact info from actually opening
a socket to a process. Currently, the OOB immediately opens a socket and
performs an "ack" communication whenever contact info for another process is
loaded into it. In addition, the OOB immediately subscribes to the job
segment of the provided process, requesting that this process be alerted to
*any* change in OOB contact info to any process in that job. These actions
need to be separated out.

2. revise the RML/OOB init/open procedure. These are currently interwoven in
a manner that causes the OOB to execute registry operations that are not
needed (and actually cause headaches) during orte_init. The procedure will
be revised so that connections to the HNP and to the process' local orted
are opened, but all other contact info (e.g., for the other procs in the
job) is simply loaded into the OOB's contact tables, but no sockets opened
until first communication.

3. revise the xcast procedure so that it relays via the daemons and not the
application processes. For systems that do not use our daemons, alternative
mechanisms will be developed.

At some point in the future, a fully routable OOB will be developed to
remove the need for so many sockets on each application process. For now,
these steps should improve our startup time considerably.

With some luck and (hopefully) not too many conflicting priorities, Jeff and
I may complete this work by Christmas - more likely, though, is sometime
early in Jan. We will be working on a tmp branch, but you may see some
transfer of code to the trunk as we progress.

As always, feel free to comment and/or make suggestions!