Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-07 12:51:19


On Nov 7, 2007, at 12:29 PM, George Bosilca wrote:

>> I finally talked with Galen and Don about this issue in depth. Our
>> understanding is that the "request may get freed before recursion
>> unwinds" issue is *only* a problem within the context of a single MPI
>> call (e.g., MPI_SEND). Is that right?
>
> I wonder how this happens ?
>
>> Specifically, if in an MPI_SEND, the BTL ends up buffering the
>> message
>> and setting early completion, but then recurses into opal_progress()
>> and ends up sending the message and freeing the request during the
>> recursion, then when the recursion unwinds, the original caller will
>> have a stale request.
>
> The same callback is called in both cases. In the case that you
> described, the callback is called just a little bit deeper into the
> recursion, when in the "normal case" it will get called from the
> first level of the recursion. Or maybe I miss something here ...

Right -- it's not the callback that is the problem. It's when the
recursion is unwound and further up the stack you now have a stale
request.

>
> george.
>
>> This is *only* a problem for requests that are involved from the
>> current top-level MPI call. Request from prior calls to MPI
>> functions
>> (e.g., a request from a prior call to MPI_ISEND) are ok because a)
>> we've already done the Right Things to ensure the safety of that
>> request, and b) that request is not on the recursive stack anywhere
>> to
>> become stale as the recursion unwinds.
>>
>> Right?
>>
>> If so, Galen proposes the following:
>>
>> 1. in conjunction with the NOT_ON_WIRE proposal...
>>
>> 2. make a new PML request flag DONT_FREE_ME (or some better
>> name :-) ).
>>
>> 3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
>> specifically, the top of the PML calls for blocking send/receive)
>> right when the request is allocated (i.e., before calling
>> btl_send()).
>>
>> 4. when the PML is called for completion on this request, it will do
>> all the stuff that it needs to effect completion -- but then it will
>> see the DONT_FREE_ME flag and not actually free the request.
>> Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
>> does today: it frees the request.
>>
>> 5. the top-level PML call will eventually complete:
>> 5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
>> MPI_RECV), the request can be unconditionally freed.
>> 5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
>> only free the request if it was completed.
>>
>> Note that with this scheme, it becomes irrelevant as to whether the
>> PML completion call is invoked on the first descent into the BTL or
>> recursively via opal_progress.
>>
>> How does that sound?
>>
>> If that all works, it might be beneficial to put this back to the 1.2
>> branch because there are definitely apps that would benefit from it.
>>
>>
>>
>> On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:
>>
>>>> So this problem goes WAY back..
>>>>
>>>> The problem here is that the PML marks MPI completion just prior to
>>>> calling
>>>> btl_send and then returns to the user. This wouldn't be a problem
>>>> if the BTL
>>>> then did something, but in the case of OpenIB this fragment may not
>>>> actually
>>>> be on the wire (the joys of user level flow control).
>>>>
>>>> One solution that we proposed was to allow btl_send to return
>>>> either
>>>> OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
>>>> PML to
>>>> not mark MPI completion of the fragment and then MPI_WAITALL and
>>>> others will
>>>> do there job properly.
>>> I even implemented this once, but there is a problem. Currently we
>>> mark
>>> request as completed on MPI level and then do btl_send(). Whenever
>>> IB completion
>>> will happen the request will be marked as complete on PML level and
>>> freed. The fix requires to change the order like this: Call
>>> btl_send(),
>>> check return value from BTL and mark request complete as necessary.
>>> The
>>> problem is that because we allow BTL to call opal_progress()
>>> internally the
>>> request may be already completed on MPI and MPL levels and freed
>>> before return from
>>> the call to btl_send().
>>>
>>> I did a code review to see how hard it will be to get rid of
>>> recursion
>>> in Open MPI and I think this is doable. We have to disallow calling
>>> progress() (or other functions that may call progress() internally)
>>> from
>>> BTL and from ULP callbacks that are called by BTL. There is no much
>>> places that break this law. The main offenders are calls to
>>> FREE_LIST_WAIT(), but those never actually call progress if they can
>>> grow without limit and this is the most common use of
>>> FREE_LIST_WAIT()
>>> so they may be safely changed to FREE_LIST_GET(). After we will
>>> solve
>>> recursion problem the fix to the problem will be a couple of lines
>>> of
>>> code.
>>>
>>>>
>>>> - Galen
>>>>
>>>>
>>>>
>>>> On 10/11/07 11:26 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>>
>>>>> On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:
>>>>>> David --
>>>>>>
>>>>>> Gleb and I just actively re-looked at this problem yesterday; we
>>>>>> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
>>>>>> 1015. We previously thought this ticket was a different problem,
>>>>>> but
>>>>>> our analysis yesterday shows that it could be a real problem in
>>>>>> the
>>>>>> openib BTL or ob1 PML (kinda think it's the openib btl because it
>>>>>> doesn't seem to happen on other networks, but who knows...).
>>>>>>
>>>>>> Gleb is investigating.
>>>>> Here is the result of the investigation. The problem is different
>>>>> than
>>>>> #1015 ticket. What we have here is one rank calls isend() of a
>>>>> small
>>>>> message and wait_all() in a loop and another one calls irecv().
>>>>> The
>>>>> problem is that isend() usually doesn't call opal_progress()
>>>>> anywhere
>>>>> and wait_all() doesn't call progress if all requests are already
>>>>> completed
>>>>> so messages are never progressed. We may force opal_progress() to
>>>>> be called
>>>>> by setting btl_openib_free_list_max to 1000. Then wait_all() will
>>>>> call
>>>>> progress because not every request will be immediately completed
>>>>> by OB1. Or
>>>>> we can limit a number of uncompleted requests that OB1 can
>>>>> allocate by setting
>>>>> pml_ob1_free_list_max to 1000. Then opal_progress() will be called
>>>>> from a
>>>>> free_list_wait() when max will be reached. The second option works
>>>>> much
>>>>> faster for me.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
>>>>>>
>>>>>>> Hi Folks,
>>>>>>>
>>>>>>> I have been seeing some nasty behaviour in collectives,
>>>>>>> particularly bcast and reduce. Attached is a reproducer (for
>>>>>>> bcast).
>>>>>>>
>>>>>>> The code will rapidly slow to a crawl (usually interpreted as a
>>>>>>> hang in real applications) and sometimes gets killed with sigbus
>>>>>>> or
>>>>>>> sigterm.
>>>>>>>
>>>>>>> I see this with
>>>>>>>
>>>>>>> openmpi-1.2.3 or openmpi-1.2.4
>>>>>>> ofed 1.2
>>>>>>> linux 2.6.19 + patches
>>>>>>> gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
>>>>>>> 4 socket, dual core opterons
>>>>>>>
>>>>>>> run as
>>>>>>>
>>>>>>> mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
>>>>>>>
>>>>>>> To my now uneducated eye it looks as if the root process is
>>>>>>> rushing
>>>>>>> ahead and not progressing earlier bcasts.
>>>>>>>
>>>>>>> Anyone else seeing similar? Any ideas for workarounds?
>>>>>>>
>>>>>>> As a point of reference, mvapich2 0.9.8 works fine.
>>>>>>>
>>>>>>> Thanks, David
>>>>>>>
>>>>>>>
>>>>>>> <bcast-hang.c>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> Cisco Systems
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> --
>>>>> Gleb.
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> --
>>> Gleb.
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems