Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-11-07 13:16:04


On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote:

>> The same callback is called in both cases. In the case that you
>> described, the callback is called just a little bit deeper into the
>> recursion, when in the "normal case" it will get called from the
>> first level of the recursion. Or maybe I miss something here ...
>
> Right -- it's not the callback that is the problem. It's when the
> recursion is unwound and further up the stack you now have a stale
> request.

That's exactly the point that I fail to see. If the request is freed
in the PML callback, then it should get release in both cases, and
therefore lead to problems all the time. Which, obviously, is not true
when we do not have this deep recursion thing going on.

Moreover, he request management is based on the reference count. The
PML level have one ref count and the MPI level have another one. In
fact, we cannot release a request until we explicitly call
ompi_request_free on it. The place where this call happens is
different between the blocking and non blocking calls. In the non
blocking case the ompi_request_free get called from the *_test
(*_wait) functions while in the blocking case it get called directly
from the MPI_Send function.

Let me summarize: a request cannot reach a stale state without a call
to ompi_request_free. This function is never called directly from the
PML level. Therefore, the recursion depth should not have any impact
on the state of the request !

Is there a simple test case I can run in order to trigger this strange
behavior ?

   Thanks,
     george.

>
>
>>
>> george.
>>
>>> This is *only* a problem for requests that are involved from the
>>> current top-level MPI call. Request from prior calls to MPI
>>> functions
>>> (e.g., a request from a prior call to MPI_ISEND) are ok because a)
>>> we've already done the Right Things to ensure the safety of that
>>> request, and b) that request is not on the recursive stack anywhere
>>> to
>>> become stale as the recursion unwinds.
>>>
>>> Right?
>>>
>>> If so, Galen proposes the following:
>>>
>>> 1. in conjunction with the NOT_ON_WIRE proposal...
>>>
>>> 2. make a new PML request flag DONT_FREE_ME (or some better
>>> name :-) ).
>>>
>>> 3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
>>> specifically, the top of the PML calls for blocking send/receive)
>>> right when the request is allocated (i.e., before calling
>>> btl_send()).
>>>
>>> 4. when the PML is called for completion on this request, it will do
>>> all the stuff that it needs to effect completion -- but then it will
>>> see the DONT_FREE_ME flag and not actually free the request.
>>> Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
>>> does today: it frees the request.
>>>
>>> 5. the top-level PML call will eventually complete:
>>> 5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
>>> MPI_RECV), the request can be unconditionally freed.
>>> 5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
>>> only free the request if it was completed.
>>>
>>> Note that with this scheme, it becomes irrelevant as to whether the
>>> PML completion call is invoked on the first descent into the BTL or
>>> recursively via opal_progress.
>>>
>>> How does that sound?
>>>
>>> If that all works, it might be beneficial to put this back to the
>>> 1.2
>>> branch because there are definitely apps that would benefit from it.
>>>
>>>
>>>
>>> On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:
>>>
>>>>> So this problem goes WAY back..
>>>>>
>>>>> The problem here is that the PML marks MPI completion just prior
>>>>> to
>>>>> calling
>>>>> btl_send and then returns to the user. This wouldn't be a problem
>>>>> if the BTL
>>>>> then did something, but in the case of OpenIB this fragment may
>>>>> not
>>>>> actually
>>>>> be on the wire (the joys of user level flow control).
>>>>>
>>>>> One solution that we proposed was to allow btl_send to return
>>>>> either
>>>>> OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
>>>>> PML to
>>>>> not mark MPI completion of the fragment and then MPI_WAITALL and
>>>>> others will
>>>>> do there job properly.
>>>> I even implemented this once, but there is a problem. Currently we
>>>> mark
>>>> request as completed on MPI level and then do btl_send(). Whenever
>>>> IB completion
>>>> will happen the request will be marked as complete on PML level and
>>>> freed. The fix requires to change the order like this: Call
>>>> btl_send(),
>>>> check return value from BTL and mark request complete as necessary.
>>>> The
>>>> problem is that because we allow BTL to call opal_progress()
>>>> internally the
>>>> request may be already completed on MPI and MPL levels and freed
>>>> before return from
>>>> the call to btl_send().
>>>>
>>>> I did a code review to see how hard it will be to get rid of
>>>> recursion
>>>> in Open MPI and I think this is doable. We have to disallow calling
>>>> progress() (or other functions that may call progress() internally)
>>>> from
>>>> BTL and from ULP callbacks that are called by BTL. There is no much
>>>> places that break this law. The main offenders are calls to
>>>> FREE_LIST_WAIT(), but those never actually call progress if they
>>>> can
>>>> grow without limit and this is the most common use of
>>>> FREE_LIST_WAIT()
>>>> so they may be safely changed to FREE_LIST_GET(). After we will
>>>> solve
>>>> recursion problem the fix to the problem will be a couple of lines
>>>> of
>>>> code.
>>>>
>>>>>
>>>>> - Galen
>>>>>
>>>>>
>>>>>
>>>>> On 10/11/07 11:26 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>>>
>>>>>> On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:
>>>>>>> David --
>>>>>>>
>>>>>>> Gleb and I just actively re-looked at this problem yesterday; we
>>>>>>> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
>>>>>>> 1015. We previously thought this ticket was a different
>>>>>>> problem,
>>>>>>> but
>>>>>>> our analysis yesterday shows that it could be a real problem in
>>>>>>> the
>>>>>>> openib BTL or ob1 PML (kinda think it's the openib btl because
>>>>>>> it
>>>>>>> doesn't seem to happen on other networks, but who knows...).
>>>>>>>
>>>>>>> Gleb is investigating.
>>>>>> Here is the result of the investigation. The problem is different
>>>>>> than
>>>>>> #1015 ticket. What we have here is one rank calls isend() of a
>>>>>> small
>>>>>> message and wait_all() in a loop and another one calls irecv().
>>>>>> The
>>>>>> problem is that isend() usually doesn't call opal_progress()
>>>>>> anywhere
>>>>>> and wait_all() doesn't call progress if all requests are already
>>>>>> completed
>>>>>> so messages are never progressed. We may force opal_progress() to
>>>>>> be called
>>>>>> by setting btl_openib_free_list_max to 1000. Then wait_all() will
>>>>>> call
>>>>>> progress because not every request will be immediately completed
>>>>>> by OB1. Or
>>>>>> we can limit a number of uncompleted requests that OB1 can
>>>>>> allocate by setting
>>>>>> pml_ob1_free_list_max to 1000. Then opal_progress() will be
>>>>>> called
>>>>>> from a
>>>>>> free_list_wait() when max will be reached. The second option
>>>>>> works
>>>>>> much
>>>>>> faster for me.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
>>>>>>>
>>>>>>>> Hi Folks,
>>>>>>>>
>>>>>>>> I have been seeing some nasty behaviour in collectives,
>>>>>>>> particularly bcast and reduce. Attached is a reproducer (for
>>>>>>>> bcast).
>>>>>>>>
>>>>>>>> The code will rapidly slow to a crawl (usually interpreted as a
>>>>>>>> hang in real applications) and sometimes gets killed with
>>>>>>>> sigbus
>>>>>>>> or
>>>>>>>> sigterm.
>>>>>>>>
>>>>>>>> I see this with
>>>>>>>>
>>>>>>>> openmpi-1.2.3 or openmpi-1.2.4
>>>>>>>> ofed 1.2
>>>>>>>> linux 2.6.19 + patches
>>>>>>>> gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
>>>>>>>> 4 socket, dual core opterons
>>>>>>>>
>>>>>>>> run as
>>>>>>>>
>>>>>>>> mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
>>>>>>>>
>>>>>>>> To my now uneducated eye it looks as if the root process is
>>>>>>>> rushing
>>>>>>>> ahead and not progressing earlier bcasts.
>>>>>>>>
>>>>>>>> Anyone else seeing similar? Any ideas for workarounds?
>>>>>>>>
>>>>>>>> As a point of reference, mvapich2 0.9.8 works fine.
>>>>>>>>
>>>>>>>> Thanks, David
>>>>>>>>
>>>>>>>>
>>>>>>>> <bcast-hang.c>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> Cisco Systems
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> --
>>>>>> Gleb.
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> --
>>>> Gleb.
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s