Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: changes to modex
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-04-02 11:19:38

On 4/2/08 8:52 AM, "Terry Dontje" <Terry.Dontje_at_[hidden]> wrote:

> Jeff Squyres wrote:
>> WHAT: Changes to MPI layer modex API
>> WHY: To be mo' betta scalable
>> WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
>> calls ompi_modex_send() and/or ompi_modex_recv()
>> TIMEOUT: COB Fri 4 Apr 2008
> [...snip...]
>> * int ompi_modex_node_send(...): send modex data that is relevant
>> for all processes in this job on this node. It is intended that only
>> one process in a job on a node will call this function. If more than
>> one process in a job on a node calls _node_send(), then only one will
>> "win" (meaning that the data sent by the others will be overwritten).
>> * int ompi_modex_node_recv(...): receive modex data that is relevant
>> for a whole peer node; receive the ["winning"] blob sent by
>> _node_send() from the source node. We haven't yet decided what the
>> node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would
>> figure out what node the (ompi_proc_t*) resides on and then give you
>> the data).
> The above sounds like there could be race conditions if more than one
> process on a node is doing
> ompi_modex_node_send. That is are you really going to be able to be
> assured when ompi_modex_node_recv
> is done that one of the processes is not in the middle of doing
> ompi_modex_node_send? I assume
> there must be some sort of gate that allows you to make sure no one is
> in the middle of overwriting your data.

The nature of the modex actually precludes this. The modex is implemented as
a barrier, so the timing actually looks like this:

1. each proc registers its modex_node[proc]_send calls early in MPI_Init.
All this does is collect the data locally in a buffer

2. each proc hits the orte_grpcomm.modex call in MPI_Init. At this point,
the collected data is sent to the local daemon. The proc "barriers" at this
point and can go no further until the modex is completed.

3. when the daemon detects that all local procs have sent it a modex buffer,
it enters an "allgather" operation across all daemons. When that operation
completes, each daemon has a complete modex buffer spanning the job.

4. each daemon "drops" the collected buffer into each local proc

5. each proc, upon receiving the modex buffer, decodes it and sets up its
data structs to respond to future modex_recv calls. Once that is completed,
the proc returns from the orte_grpcomm.modex call and is released from the

So we resolve the race condition by including a "barrier" inside the modex.
This is the current behavior as well - so this represents no change, just a
different organization of the modex'd data.

> --td
> _______________________________________________
> devel mailing list
> devel_at_[hidden]