I have been seeing some nasty behaviour in collectives, particularly
bcast and reduce. Attached is a reproducer (for bcast).
The code will rapidly slow to a crawl (usually interpreted as a hang
in real applications) and sometimes gets killed with sigbus or sigterm.
I see this with
openmpi-1.2.3 or openmpi-1.2.4
linux 2.6.19 + patches
gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
4 socket, dual core opterons
mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
To my now uneducated eye it looks as if the root process is rushing
ahead and not progressing earlier bcasts.
Anyone else seeing similar? Any ideas for workarounds?
As a point of reference, mvapich2 0.9.8 works fine.