Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] trac 1857: SM btl hangs when msg >=4k
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-04-07 09:16:51


FWIW:

- running osu_bw with mpool_sm_min_size==0 hangs for me on x86_64,
RHEL4U4/6
- running osu_bw with mpool_sm_min_size==<large> works for me on
x86_64, RHEL4U4/6

Tested this morning with trunk r

On Apr 7, 2009, at 7:07 AM, Lenny Verkhovsky wrote:

> r20948 still hangs, changing mpool_sm_min_size solves it.
>
> Lenny.
>
> On Tue, Apr 7, 2009 at 3:42 AM, Eugene Loh <Eugene.Loh_at_[hidden]> wrote:
> George Bosilca wrote:
>
> You're right, the sentence was messed-up. My intent was to say that
> I found the problem, made a fix and once this fix applied to the
> trunk I was not able to reproduce the deadlock.
>
> But you were able to reproduce the deadlock before you made the fix?
>
> Anyhow, if I get fresh bits (through r20947) and I back out r20944
> (either in the source code or simply by setting the
> mpool_sm_min_size MCA parameter to 0), I get deadlock.
>
>
> Based on your description of the bug I forced osu_bw to send 1024
> non- blocking sends (instead of the default 64), and I still don't
> get the deadlock. I'm trilled ...
>
> Yes, that's a good test. You're sure you had mpool_sm_min_size set
> to 0? I just don't have the same luck you do. I get the hang even
> with your fixes.
>
>
> On Apr 6, 2009, at 19:56 , Eugene Loh wrote:
>
> George Bosilca wrote:
>
> I got some free time (yeh haw) and took a look at the OB1 PML in
> order to fix the issue. I think I found the problem, as I'm unable
> to reproduce this error.
>
> Sorry, this sentence has me baffled. Are you unable to reproduce
> the problem before the fixes or afterwards? The first step is to
> reproduce the problem, right? To do so:
>
> A) Back out r20944. Easy way to do that is just
>
> % setenv OMPI_MCA_mpool_sm_min_size 0
>
> B) Check that osu_bw.c hangs when using sm and you reach
> rendezvous message size.
>
> C) Introduce your changes and make sure that osu_bw.c runs to
> completion.
>
> Can you please give it a try with 20946 and 20947 but without 20944?
>
> osu_bw.c hangs for me. The PML fix did not seem to work.
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems