George Bosilca wrote:
> You're right, the sentence was messed-up. My intent was to say that I
> found the problem, made a fix and once this fix applied to the trunk
> I was not able to reproduce the deadlock.
But you were able to reproduce the deadlock before you made the fix?
Anyhow, if I get fresh bits (through r20947) and I back out r20944
(either in the source code or simply by setting the mpool_sm_min_size
MCA parameter to 0), I get deadlock.
> Based on your description of the bug I forced osu_bw to send 1024 non-
> blocking sends (instead of the default 64), and I still don't get the
> deadlock. I'm trilled ...
Yes, that's a good test. You're sure you had mpool_sm_min_size set to
0? I just don't have the same luck you do. I get the hang even with
your fixes.
> On Apr 6, 2009, at 19:56 , Eugene Loh wrote:
>
>> George Bosilca wrote:
>>
>>> I got some free time (yeh haw) and took a look at the OB1 PML in
>>> order to fix the issue. I think I found the problem, as I'm unable
>>> to reproduce this error.
>>
>> Sorry, this sentence has me baffled. Are you unable to reproduce
>> the problem before the fixes or afterwards? The first step is to
>> reproduce the problem, right? To do so:
>>
>> A) Back out r20944. Easy way to do that is just
>>
>> % setenv OMPI_MCA_mpool_sm_min_size 0
>>
>> B) Check that osu_bw.c hangs when using sm and you reach rendezvous
>> message size.
>>
>> C) Introduce your changes and make sure that osu_bw.c runs to
>> completion.
>>
>>> Can you please give it a try with 20946 and 20947 but without 20944?
>>
>> osu_bw.c hangs for me. The PML fix did not seem to work.
>
|