Gleb and I just actively re-looked at this problem yesterday; we
think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
1015. We previously thought this ticket was a different problem, but
our analysis yesterday shows that it could be a real problem in the
openib BTL or ob1 PML (kinda think it's the openib btl because it
doesn't seem to happen on other networks, but who knows...).
Gleb is investigating.
On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
> Hi Folks,
> I have been seeing some nasty behaviour in collectives,
> particularly bcast and reduce. Attached is a reproducer (for bcast).
> The code will rapidly slow to a crawl (usually interpreted as a
> hang in real applications) and sometimes gets killed with sigbus or
> I see this with
> openmpi-1.2.3 or openmpi-1.2.4
> ofed 1.2
> linux 2.6.19 + patches
> gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
> 4 socket, dual core opterons
> run as
> mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
> To my now uneducated eye it looks as if the root process is rushing
> ahead and not progressing earlier bcasts.
> Anyone else seeing similar? Any ideas for workarounds?
> As a point of reference, mvapich2 0.9.8 works fine.
> Thanks, David
> devel mailing list