Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: David Daniel (ddd_at_[hidden])
Date: 2007-10-04 18:59:01


Hi Folks,

I have been seeing some nasty behaviour in collectives, particularly
bcast and reduce. Attached is a reproducer (for bcast).

The code will rapidly slow to a crawl (usually interpreted as a hang
in real applications) and sometimes gets killed with sigbus or sigterm.

I see this with

   openmpi-1.2.3 or openmpi-1.2.4
   ofed 1.2
   linux 2.6.19 + patches
   gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
   4 socket, dual core opterons

run as

   mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang

To my now uneducated eye it looks as if the root process is rushing
ahead and not progressing earlier bcasts.

Anyone else seeing similar? Any ideas for workarounds?

As a point of reference, mvapich2 0.9.8 works fine.

Thanks, David