Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] ORTE
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-06-16 06:08:27

Over the next month, there will be significant changes to ORTE both in terms of framework APIs and internal behavior. This work will focus on a few areas:

1. launch scalability and timing. I try to review our status on this whenever we prepare for the start of a new release series, and as usual this prompted some work in this area. Most of the effort will focus on development of the async modex functionality described in a separate email thread.

2. access to the BTLs, which we recently agreed to move to the OPAL layer.

3. memory footprint reduction, particularly the removal or minimization of per-proc data stored in every process (e.g., retaining a complete copy of all modex info for all ranks in each process, regardless of communication needs).

It was my understanding that others interested in ORTE had forked their code bases and were not tracking the main developer's trunk. However, at the recent developers meeting, this understanding was altered as it appears other groups are actually attempting to track the trunk, resolving conflicts behind the scenes. In an attempt to aid these groups, I thought it might help if I outlined what will be happening in the near future.

The biggest anticipated changes lie in the modex and RML/OOB areas. I've outlined the async modex changes in a separate email thread. One additional element of that work will be the porting of the "db" (database) framework back from the ORCM project to ORTE. This framework provides a "hook" for researchers working on distributed, high-performance databases to investigate alternative ways of scalably supporting our modex information in a fault-tolerant manner. Eventually, the work in areas such as distributed hash-tables (DHTs) used in ORCM may make its way back to OMPI.

In addition to scalability, the modex work is intended to contribute to the memory reduction goal. The primary emphasis here will be on changing from having each process retain complete knowledge of the contact info, locality, etc. for every process in the job, to a strategy of only caching info for processes with which the proc actually is communicating. We may look at removing all per-proc caching of info (perhaps using a shared memory model), but that has performance implications and needs further investigation.

As part of that effort, we will be removing the nidmap/pidmap constructs and storing that info in the same database being used by the modex, thus collapsing the grpcomm and ess APIs by consolidating the access to proc-related data in the "db" API. The grpcomm framework will retain responsibility for executing RTE collective operations such as modex, but the data will be stored in the db. Likewise, the ess will no longer be used to access data such as a proc's locality - instead, that data will be obtained from the db, however it is stored or where it is located.

The modex work is tentatively slated for the 1.7 series, though how much of it gets there remains to be seen. The work is being done in a bitbucket repo:

Changes to the RML/OOB are largely driven by the long-standing need to cleanup/refactor that code, the need to support async progress on messaging, and the upcoming availability of the BTLs. This code has served us well for quite some time, but the to-do list has grown over the years, including the desire for better support of multi-NIC environments. The work will require significant changes to the RML and OOB framework APIs, including:

* the removal of blocking sends (a persistent source of trouble over the years)

* moving receive matching logic to the RML layer, thus simplifying the OOB components and making them look more like the BTLs.

* adding a UDP component (ported back from ORCM) to the OOB, along with creating retransmit and flow control support frameworks in OPAL (modeled after the ORCM version) to handle unreliable transports in both BTL (which will also receive a UDP component) and OOB

* converting the OOB to a standalone (i.e., no longer opened and inited from inside the RML), multi-select framework that supports multiple transports

* allowing each OOB component to return an array of modules, one for each interface (ala the BTL) - this obviously has implications for the "comm failed" error response as a failed connection to one OOB module may not mean complete loss of connectivity or process death

* changing the URI construct/parsing methods for the initial contact info that goes on the orted cmd line to reflect the above changes, allowing multiple OOB modules to contribute to it while retaining the ability to limit overall string size

* altering the OOBs to use the modex construct for exchange of endpoint info

* shifting the routing responsibilities from the RML to the OOB level to accommodate connectionless transports. The OOB module will determine if routing is required and send the message accordingly. When received, the message will be "promoted" to the RML layer, thus allowing the routing process to decide the best transport to use from that point forward (e.g., continuing to route the message, or shifting to a connectionless transport to send the message directly to its destination).

* adding support for OOB failover, with each module in an OOB component attempting to send a message via alternative modules if a module is unable to complete transmission, and then returning the message to the RML for rescheduling on another transport if no module can successfully complete the operation.

* adding heartbeat support for situations where a connectionless transport is the sole contact point between daemons - we already have heartbeat capability in the code base, but need the proper hooks to ensure it is active when needed.

This work is definitely pointed at the 2.0 series (not 1.7), and will begin entering the trunk after the branch. The work is being done in another bitbucket repo: