Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Commit r19868
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-31 19:50:59

Hi all

I made a commit a little earlier that contains modifications that
reduces duplicate data storage and represents a first step towards
supporting fully routed RML communications, along with a new "radix
tree" routed component requested by ORNL. There will undoubtedly be
improvements to these changes over the next few months, but they
provide an initial platform for us to more thoroughly investigate the
issues involved in fully routing all out-of-band communications.

A brief outline of the changes include:

1. removes the direct routed component and adds a new "radix" component

2. shifts storage of nidmap and pidmap info from the odls to the ess
on daemons - this is where the data is stored for everyone else, so it
makes no sense to store it someplace different on the daemon. Required
adding an API to the ess framework so that a pidmap can be added to
the data in the ess when daemons get a comm_spawn request (the ess
data store was already setup for this - just didn't have the API yet).

3. adds an API to the ess framework to obtain the daemon that hosts a
specified proc from the ess pidmap. Because this data is now obtained
here, we don't need to keep calling orte_routed.update_route for every
proc in our own job - so those calls have been removed from the
startup procedure. This eliminates the hash tables in every routed
module that essentially duplicated the pidmap already present in the
ess - not because anyone was stupid, but rather because the first
routed modules were originally written prior to the ess pidmap being
created, and everyone copy/pasted from there.

At the moment, the revised trunk fully routes all communications with
two exceptions:

1. the binomial module still directly routes between all daemons -
i.e., communications don't flow along the tree, but instead short-
circuit the tree to go directly to the daemon that hosts the target
proc. I propose to change this in a later revision, but want to leave
something constant for the moment.

2. all routed modules have daemons sending direct to the HNP itself.
This was required for two reasons:

(a) during startup, the daemons need to "phone home", but have no
knowledge at that moment of the contact info for the other daemons in
the routing tree. Thus, they have no choice but to send direct to the
HNP. We hope to change this in a later revision by switching to well-
known static ports - but for now, we have to go direct.

(b) in our current shutdown procedure, the outbound message telling
the orteds to terminate goes out across the module's routing tree.
This xcast procedure causes the daemon to relay the cmd to the next
daemons in the tree, and then to execute it. Thus, after relaying the
cmd, the daemon dutifully terminates. However, we require each daemon
to send a confirming message to return to the HNP so it knows it can
exit. That returning message cannot get through because the
intermediate daemons have already terminated. I am working on
alternative methods for detecting daemon termination so we can
eliminate the return "ack" - but for now, we have to send the "ack"
direct to the HNP to ensure it gets through.

Some preliminary tests I've conducted indicate that fully routing
communications had no detrimental impact on launch speed nor IB wireup
time. I plan to further test this at larger scales, as well as
continue to develop the new capabilities.

Please let me know if you encounter any problems, or have any comments/