Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] trac 1857: SM btl hangs when msg >=4k
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-04-07 13:00:33


This is interesting. I cannot trigger this deadlock on my AMD cluster
even when I set the sm_min_size to zero. However, on a Intel cluster
this can be triggered pretty easily.

Anyway, I think I finally understood where the problem is coming from.
r20952 and r20953 are commits that, in addition to the ones from
yesterday, fix the problem. Please read the log on r20953 to see how
this happens.

Of course, please stress it before we move it to the 1.3 branch.

   george.

On Apr 7, 2009, at 09:16 , Jeff Squyres wrote:

> FWIW:
>
> - running osu_bw with mpool_sm_min_size==0 hangs for me on x86_64,
> RHEL4U4/6
> - running osu_bw with mpool_sm_min_size==<large> works for me on
> x86_64, RHEL4U4/6
>
> Tested this morning with trunk r
>
>
> On Apr 7, 2009, at 7:07 AM, Lenny Verkhovsky wrote:
>
>> r20948 still hangs, changing mpool_sm_min_size solves it.
>>
>> Lenny.
>>
>> On Tue, Apr 7, 2009 at 3:42 AM, Eugene Loh <Eugene.Loh_at_[hidden]>
>> wrote:
>> George Bosilca wrote:
>>
>> You're right, the sentence was messed-up. My intent was to say that
>> I found the problem, made a fix and once this fix applied to the
>> trunk I was not able to reproduce the deadlock.
>>
>> But you were able to reproduce the deadlock before you made the fix?
>>
>> Anyhow, if I get fresh bits (through r20947) and I back out r20944
>> (either in the source code or simply by setting the
>> mpool_sm_min_size MCA parameter to 0), I get deadlock.
>>
>>
>> Based on your description of the bug I forced osu_bw to send 1024
>> non- blocking sends (instead of the default 64), and I still don't
>> get the deadlock. I'm trilled ...
>>
>> Yes, that's a good test. You're sure you had mpool_sm_min_size set
>> to 0? I just don't have the same luck you do. I get the hang even
>> with your fixes.
>>
>>
>> On Apr 6, 2009, at 19:56 , Eugene Loh wrote:
>>
>> George Bosilca wrote:
>>
>> I got some free time (yeh haw) and took a look at the OB1 PML in
>> order to fix the issue. I think I found the problem, as I'm
>> unable to reproduce this error.
>>
>> Sorry, this sentence has me baffled. Are you unable to reproduce
>> the problem before the fixes or afterwards? The first step is to
>> reproduce the problem, right? To do so:
>>
>> A) Back out r20944. Easy way to do that is just
>>
>> % setenv OMPI_MCA_mpool_sm_min_size 0
>>
>> B) Check that osu_bw.c hangs when using sm and you reach
>> rendezvous message size.
>>
>> C) Introduce your changes and make sure that osu_bw.c runs to
>> completion.
>>
>> Can you please give it a try with 20946 and 20947 but without 20944?
>>
>> osu_bw.c hangs for me. The PML fix did not seem to work.
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel