On 01/04/2010 01:23 AM, Eugene Loh wrote:
about "-mca coll_sync_barrier_before 100"? (The default may be
1000. So, you can try various values less than 1000. I'm suggesting
100.) Note that broadcast has somewhat one-way traffic flow, which can
have some undesirable flow control issues.
Louis Rossi wrote:
2) What about "-mca btl_sm_num_fifos 16"? Default is 1. If the
problem is trac ticket 2043, then this suggestion can help.
Louis Rossi wrote:
Thank you for replying so quickly. You are right that there is a
memory leak. It's not the source of the problem, but I added a
free(pMessage) to remove the issue. (In my defense, I borrowed a
simple broadcast example off the web and wrapped it in a loop.)
Anyway, the great news is that suggestion #2 solved the problem for
the example. (At least it has not failed yet. I'm exercising the
solution on the original larger problem now.) Suggestion #1 did not.
Should I post the resolution to the mailing list or is this a well
known solution? I see this parameter listed under performance tuning
on the ompi site, but only in reference to congestion. There is no
comment that bcasts could hang.
Great. Next time, go ahead and respond to the wider mail alias so that
everyone learns that your particular problem was resolved.
OK. You nailed it with suggestion #2.
I will update the trac ticket to point to this as another instance of
One signature of the problem is that GCC 4.4.0 or later exposes the
problem, while earlier revs do not. I can't tell for sure, but it
appears to me that this condition is met with Fedora 11.
Our understanding of trac 2043 has recently improved immensely. It
would be great if you could confirm the fix. The ticket is at
https://svn.open-mpi.org/trac/ompi/ticket/2043 . r22324 should fix the
problem. If you could get that version, build with GCC (presumably
4.4.0 or more recent), then the workaround should no longer be needed.