Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-11-07 12:29:41


On Nov 7, 2007, at 11:06 AM, Jeff Squyres wrote:

> Gleb --
>
> I finally talked with Galen and Don about this issue in depth. Our
> understanding is that the "request may get freed before recursion
> unwinds" issue is *only* a problem within the context of a single MPI
> call (e.g., MPI_SEND). Is that right?

I wonder how this happens ?

> Specifically, if in an MPI_SEND, the BTL ends up buffering the message
> and setting early completion, but then recurses into opal_progress()
> and ends up sending the message and freeing the request during the
> recursion, then when the recursion unwinds, the original caller will
> have a stale request.

The same callback is called in both cases. In the case that you
described, the callback is called just a little bit deeper into the
recursion, when in the "normal case" it will get called from the first
level of the recursion. Or maybe I miss something here ...

   george.

> This is *only* a problem for requests that are involved from the
> current top-level MPI call. Request from prior calls to MPI functions
> (e.g., a request from a prior call to MPI_ISEND) are ok because a)
> we've already done the Right Things to ensure the safety of that
> request, and b) that request is not on the recursive stack anywhere to
> become stale as the recursion unwinds.
>
> Right?
>
> If so, Galen proposes the following:
>
> 1. in conjunction with the NOT_ON_WIRE proposal...
>
> 2. make a new PML request flag DONT_FREE_ME (or some better
> name :-) ).
>
> 3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
> specifically, the top of the PML calls for blocking send/receive)
> right when the request is allocated (i.e., before calling btl_send()).
>
> 4. when the PML is called for completion on this request, it will do
> all the stuff that it needs to effect completion -- but then it will
> see the DONT_FREE_ME flag and not actually free the request.
> Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
> does today: it frees the request.
>
> 5. the top-level PML call will eventually complete:
> 5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
> MPI_RECV), the request can be unconditionally freed.
> 5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
> only free the request if it was completed.
>
> Note that with this scheme, it becomes irrelevant as to whether the
> PML completion call is invoked on the first descent into the BTL or
> recursively via opal_progress.
>
> How does that sound?
>
> If that all works, it might be beneficial to put this back to the 1.2
> branch because there are definitely apps that would benefit from it.
>
>
>
> On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:
>
>>> So this problem goes WAY back..
>>>
>>> The problem here is that the PML marks MPI completion just prior to
>>> calling
>>> btl_send and then returns to the user. This wouldn't be a problem
>>> if the BTL
>>> then did something, but in the case of OpenIB this fragment may not
>>> actually
>>> be on the wire (the joys of user level flow control).
>>>
>>> One solution that we proposed was to allow btl_send to return either
>>> OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
>>> PML to
>>> not mark MPI completion of the fragment and then MPI_WAITALL and
>>> others will
>>> do there job properly.
>> I even implemented this once, but there is a problem. Currently we
>> mark
>> request as completed on MPI level and then do btl_send(). Whenever
>> IB completion
>> will happen the request will be marked as complete on PML level and
>> freed. The fix requires to change the order like this: Call
>> btl_send(),
>> check return value from BTL and mark request complete as necessary.
>> The
>> problem is that because we allow BTL to call opal_progress()
>> internally the
>> request may be already completed on MPI and MPL levels and freed
>> before return from
>> the call to btl_send().
>>
>> I did a code review to see how hard it will be to get rid of
>> recursion
>> in Open MPI and I think this is doable. We have to disallow calling
>> progress() (or other functions that may call progress() internally)
>> from
>> BTL and from ULP callbacks that are called by BTL. There is no much
>> places that break this law. The main offenders are calls to
>> FREE_LIST_WAIT(), but those never actually call progress if they can
>> grow without limit and this is the most common use of
>> FREE_LIST_WAIT()
>> so they may be safely changed to FREE_LIST_GET(). After we will solve
>> recursion problem the fix to the problem will be a couple of lines of
>> code.
>>
>>>
>>> - Galen
>>>
>>>
>>>
>>> On 10/11/07 11:26 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>
>>>> On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:
>>>>> David --
>>>>>
>>>>> Gleb and I just actively re-looked at this problem yesterday; we
>>>>> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
>>>>> 1015. We previously thought this ticket was a different problem,
>>>>> but
>>>>> our analysis yesterday shows that it could be a real problem in
>>>>> the
>>>>> openib BTL or ob1 PML (kinda think it's the openib btl because it
>>>>> doesn't seem to happen on other networks, but who knows...).
>>>>>
>>>>> Gleb is investigating.
>>>> Here is the result of the investigation. The problem is different
>>>> than
>>>> #1015 ticket. What we have here is one rank calls isend() of a
>>>> small
>>>> message and wait_all() in a loop and another one calls irecv(). The
>>>> problem is that isend() usually doesn't call opal_progress()
>>>> anywhere
>>>> and wait_all() doesn't call progress if all requests are already
>>>> completed
>>>> so messages are never progressed. We may force opal_progress() to
>>>> be called
>>>> by setting btl_openib_free_list_max to 1000. Then wait_all() will
>>>> call
>>>> progress because not every request will be immediately completed
>>>> by OB1. Or
>>>> we can limit a number of uncompleted requests that OB1 can
>>>> allocate by setting
>>>> pml_ob1_free_list_max to 1000. Then opal_progress() will be called
>>>> from a
>>>> free_list_wait() when max will be reached. The second option works
>>>> much
>>>> faster for me.
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> I have been seeing some nasty behaviour in collectives,
>>>>>> particularly bcast and reduce. Attached is a reproducer (for
>>>>>> bcast).
>>>>>>
>>>>>> The code will rapidly slow to a crawl (usually interpreted as a
>>>>>> hang in real applications) and sometimes gets killed with sigbus
>>>>>> or
>>>>>> sigterm.
>>>>>>
>>>>>> I see this with
>>>>>>
>>>>>> openmpi-1.2.3 or openmpi-1.2.4
>>>>>> ofed 1.2
>>>>>> linux 2.6.19 + patches
>>>>>> gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
>>>>>> 4 socket, dual core opterons
>>>>>>
>>>>>> run as
>>>>>>
>>>>>> mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
>>>>>>
>>>>>> To my now uneducated eye it looks as if the root process is
>>>>>> rushing
>>>>>> ahead and not progressing earlier bcasts.
>>>>>>
>>>>>> Anyone else seeing similar? Any ideas for workarounds?
>>>>>>
>>>>>> As a point of reference, mvapich2 0.9.8 works fine.
>>>>>>
>>>>>> Thanks, David
>>>>>>
>>>>>>
>>>>>> <bcast-hang.c>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> --
>>>> Gleb.
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> --
>> Gleb.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s