Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] IOF and scalability
From: Tim Mattox (timattox_at_[hidden])
Date: 2008-08-28 07:29:58


Great find Ralph!

On Wed, Aug 27, 2008 at 7:39 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> Hello all
>
> As some of you may remember, I am in the process of rewriting the IOF
> subsystem. While working my way through the revisions, I discovered
> something about the current IOF that significantly impacts scalability.
> Since I know some people retain interest in that area, I thought I would
> pass the observations along.
>
> When an orted fork/exec's an application process, it automatically wires up
> the IOF for that process. In the current system, that entails sending a
> minimum of three messages to mpirun for each process, each message in turn
> generating an "ack" message back to the orted. Thus, during launch, the IOF
> is sending more than 6*nprocs messages across the OOB.
>
> Unfortunately, this is all done outside of our daemon collective system, so
> every message is handled independently on both ends. As you can imagine,
> mpirun gets somewhat deluged for large jobs. With the advent of the
> orte_routed framework, at least these messages don't create new TCP
> connections - but they do force mpirun to deal with a large number of
> inbound messages.
>
> Lest someone think the original authors were "stupid", let me hasten to
> point out that they wrote this system to a clear set of requirements focused
> on creating a generic RTE - i.e., one not tailored to OMPI's specific needs.
> Thus, the system was designed to support capabilities we don't need, and
> couldn't take advantage of any knowledge of the end-state OMPI was trying to
> achieve.
>
> As an example of the impact, on RoadRunner, the current IOF results in the
> transmission of over 72,000 messages between procs and mpirun during startup
> of a petaflop application - just to wireup the IOF.
>
> In the rewrite, I am taking advantage of knowing OMPI's desired final
> configuration to eliminate -all- of these communications. Should improve
> things considerably - hope to have it completed in a week or two, though it
> won't come into the trunk until we release 1.3.
>
> Ralph
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmattox_at_[hidden] || timattox_at_[hidden]
 I'm a bright... http://www.the-brights.net/