Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [EXTERNAL] RFC: async modex
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-19 16:31:31


I have the branch complete for executing this - please see

https://bitbucket.org/rhc/ompi-scale

Timeout set to Feb 4th after that week's telecon

On Jan 17, 2014, at 9:57 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> After discussion on the telecon, we decided to:
>
> 1. let the modex be non-blocking so we can fall thru - only when the corresponding MCA param is set!
>
> 2. do not modify the modex_recv to add the callback as the MPI layer really doesn't know how to handle this in an async fashion. Modifying that behavior would be difficult and could wind up impacting the critical path - something we may decide to look into more at a later time
>
> So we will block in a call to modex_recv until the info for that particular process can be obtained. I'll add a timeout feature (via yet another MCA param) so we can gracefully recover if the remote proc never answers for some reason.
>
> Will provide an update when this is ready
>
>
> On Jan 13, 2014, at 3:00 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> What I want to do is make the current "modex" become a no-op, which means we have a lazy modex. As I noted in my commit message, this scales horribly if we don't, hence the MCA param requirement so people don't do this unless their BTL/MTLs will support it.
>>
>> What I found when playing with that arrangement is that a BTL/MTL is going to need or want data at first message, but that data may not be available yet because that particular remote proc hasn't registered all of its modex data yet. A beautiful race condition. So I was forced to block everyone at "modex" just to ensure the remote data was available at time of request.
>>
>> If I remove the global "barrier" requirement, then I didn't want to "block" on modex_recv as this is done on a per-proc basis. Even though one proc isn't ready to return the data, another might be - and so I'd let you queue up as many modex_recv calls as you like, resolving each of them as data becomes available. This leaves the MPI layer free to send a message as soon as the target remote proc is ready, without waiting for some other proc to register its modex info.
>>
>> Make sense?
>>
>>
>>
>> On Mon, Jan 13, 2014 at 12:05 PM, Barrett, Brian W <bwbarre_at_[hidden]> wrote:
>> Is there any place that this can actually be used? It's a fairly large
>> change to the RTE interface (which we should try to keep stable), and I
>> can't convince myself that it's useful; in general, if a BTL or MTL is
>> asking for a piece of data, the MPI library is stuck until that data's
>> available. I can see doing some lazy approach, but I can't see making the
>> modex_recv call non-blocking.
>>
>> Brian
>>
>> On 1/11/14 9:28 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>
>> >NOTE: This will involve a change to the MPI-RTE interface
>> >
>> >WHAT: Modify modex_recv to add a callback function that will return the
>> >requested data when it is available
>> >
>> >WHY: Enable faster startup on large scale systems by eliminating the
>> >current mandatory modex barrier during MPI_Init
>> >
>> >HOW: The ompi_modex_recv functions will have callback function and
>> >(void*)cbdata arguments added to them.
>> > An ompi_modex_recv_t struct will be defined that includes a
>> >pointer to the returned data plus a "bool active"
>> > that can be used to detect when the data has been returned
>> >if blocking is required.
>> >
>> > When a modex_recv is issued, ORTE will check for the
>> >presence of the requested data and immediately
>> > issue a callback if the data is available. If the data is
>> >not available, then ORTE will request the data from
>> > the remote process, and execute the callback when the
>> >remote process returns it.
>> >
>> > The current behavior of a blocking modex barrier will
>> >remain the default - the new behavior will only take affect
>> > if specifically requested by the user via MCA param. With
>> >this new behavior, the current call to "modex" in
>> > MPI_Init will become a "no-op" when the processes are
>> >launched via mpirun - this will be executed in ORTE
>> > so that other RTEs that do not wish to support async modex
>> >behavior are not impacted.
>> >
>> >WHEN: No hurry on this as it is intended for 1.9, so let's say mid Feb.
>> >Info on a branch will be made available in
>> > the near future.
>> >
>> >
>> >_______________________________________________
>> >devel mailing list
>> >devel_at_[hidden]
>> >http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>>
>>
>> --
>> Brian W. Barrett
>> Scalable System Software Group
>> Sandia National Laboratories
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>