Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Dual quad core Opteron hangs on Bcast.
From: Louis Rossi (rossi_at_[hidden])
Date: 2010-01-05 11:33:30


Hi Eugene,

   I believe that r22335 did solve resolve the issue. The problem was
between my screen and my chair. Last night, I reset my paths, but the
directory was appended to the paths which had the old mpi directory
information. I think it was linking with the old libraries. I'll try
it in a production run, but it passed the simpler tests that the old
library failed. I'll post another note if it fails anywhere, but I am
confident that the problem is resolved as you first thought.

   Best regards,
     Lou

On 01/05/2010 10:55 AM, Eugene Loh wrote:
> Hmm, perhaps not so excellent. It seems to me that
> openmpi-1.4a1r22335 does have the fixes to trac 2043. So, either the
> fixes are insufficient and/or you're experiencing a different
> problem. I'll see if I can reproduce your problem, but I'm not
> confident here.
>
> Louis Rossi wrote:
>> Hi Eugene,
>> Excellent! I could not find r22324. I found r22335 on the openmpi
>> download site (under nightly snapshots), but this did not solve the
>> problem. Any thoughts on where I can find it?
>>
>> On 01/04/2010 09:53 AM, Eugene Loh wrote:
>>> On 01/04/2010 01:23 AM, Eugene Loh wrote:
>>>> 1) What about "-mca coll_sync_barrier_before 100"? (The default
>>>> may be 1000. So, you can try various values less than 1000. I'm
>>>> suggesting 100.) Note that broadcast has somewhat one-way traffic
>>>> flow, which can have some undesirable flow control issues.
>>>>
>>>> 2) What about "-mca btl_sm_num_fifos 16"? Default is 1. If the
>>>> problem is trac ticket 2043, then this suggestion can help.
>>> Louis Rossi wrote:
>>>> Hi Eugene,
>>>>
>>>> Thank you for replying so quickly. You are right that there is a
>>>> memory leak. It's not the source of the problem, but I added a
>>>> free(pMessage) to remove the issue. (In my defense, I borrowed a
>>>> simple broadcast example off the web and wrapped it in a loop.)
>>>>
>>>> Anyway, the great news is that suggestion #2 solved the problem
>>>> for the example. (At least it has not failed yet. I'm exercising
>>>> the solution on the original larger problem now.) Suggestion #1
>>>> did not. Should I post the resolution to the mailing list or is
>>>> this a well known solution? I see this parameter listed under
>>>> performance tuning on the ompi site, but only in reference to
>>>> congestion. There is no comment that bcasts could hang.
>>> Louis Rossi wrote:
>>>> Hi Eugene,
>>>> OK. You nailed it with suggestion #2.
>>>> Many thanks,
>>>> Lou
>>> Great. Next time, go ahead and respond to the wider mail alias so
>>> that everyone learns that your particular problem was resolved.
>>>
>>> I will update the trac ticket to point to this as another instance
>>> of this problem.
>>>
>>> One signature of the problem is that GCC 4.4.0 or later exposes the
>>> problem, while earlier revs do not. I can't tell for sure, but it
>>> appears to me that this condition is met with Fedora 11.
>>>
>>> Our understanding of trac 2043 has recently improved immensely. It
>>> would be great if you could confirm the fix. The ticket is at
>>> https://svn.open-mpi.org/trac/ompi/ticket/2043 . r22324 should fix
>>> the problem. If you could get that version, build with GCC
>>> (presumably 4.4.0 or more recent), then the workaround should no
>>> longer be needed.
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
"Through nonaction, no action is left undone." --Lao Tzu
Louis F. Rossi				rossi_at_[hidden]
Department of Mathematical Sciences	http://www.math.udel.edu/~rossi
University of Delaware			(302) 831-1880 (voice)
Newark, DE 19716			(302) 831-4511 (fax)