What I want to do is make the current "modex" become a no-op, which means
we have a lazy modex. As I noted in my commit message, this scales horribly
if we don't, hence the MCA param requirement so people don't do this unless
their BTL/MTLs will support it.
What I found when playing with that arrangement is that a BTL/MTL is going
to need or want data at first message, but that data may not be available
yet because that particular remote proc hasn't registered all of its modex
data yet. A beautiful race condition. So I was forced to block everyone at
"modex" just to ensure the remote data was available at time of request.
If I remove the global "barrier" requirement, then I didn't want to "block"
on modex_recv as this is done on a per-proc basis. Even though one proc
isn't ready to return the data, another might be - and so I'd let you queue
up as many modex_recv calls as you like, resolving each of them as data
becomes available. This leaves the MPI layer free to send a message as soon
as the target remote proc is ready, without waiting for some other proc to
register its modex info.
On Mon, Jan 13, 2014 at 12:05 PM, Barrett, Brian W <bwbarre_at_[hidden]>wrote:
> Is there any place that this can actually be used? It's a fairly large
> change to the RTE interface (which we should try to keep stable), and I
> can't convince myself that it's useful; in general, if a BTL or MTL is
> asking for a piece of data, the MPI library is stuck until that data's
> available. I can see doing some lazy approach, but I can't see making the
> modex_recv call non-blocking.
> On 1/11/14 9:28 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
> >NOTE: This will involve a change to the MPI-RTE interface
> >WHAT: Modify modex_recv to add a callback function that will return the
> >requested data when it is available
> >WHY: Enable faster startup on large scale systems by eliminating the
> >current mandatory modex barrier during MPI_Init
> >HOW: The ompi_modex_recv functions will have callback function and
> >(void*)cbdata arguments added to them.
> > An ompi_modex_recv_t struct will be defined that includes a
> >pointer to the returned data plus a "bool active"
> > that can be used to detect when the data has been returned
> >if blocking is required.
> > When a modex_recv is issued, ORTE will check for the
> >presence of the requested data and immediately
> > issue a callback if the data is available. If the data is
> >not available, then ORTE will request the data from
> > the remote process, and execute the callback when the
> >remote process returns it.
> > The current behavior of a blocking modex barrier will
> >remain the default - the new behavior will only take affect
> > if specifically requested by the user via MCA param. With
> >this new behavior, the current call to "modex" in
> > MPI_Init will become a "no-op" when the processes are
> >launched via mpirun - this will be executed in ORTE
> > so that other RTEs that do not wish to support async modex
> >behavior are not impacted.
> >WHEN: No hurry on this as it is intended for 1.9, so let's say mid Feb.
> >Info on a branch will be made available in
> > the near future.
> >devel mailing list
> Brian W. Barrett
> Scalable System Software Group
> Sandia National Laboratories
> devel mailing list