Subject: [OMPI users] Dual quad core Opteron hangs on Bcast.
From: Louis Rossi (rossi_at_[hidden])
Date: 2010-01-04 01:04:32

I am having a problem with BCast hanging on a dual quad core Opteron
(2382, 2.6GHz, Quad Core, 4 x 512KB L2, 6MB L3 Cache) system running
FC11 with openmpi-1.4. The LD_LIBRARY_PATH and PATH variables are
correctly set. I have used the FC11 rpm distribution of openmpi and
built openmpi-1.4 locally with the same results. The problem was first
observed in a larger reliable CFD code, but I can create the problem
with a simple demo code (attached). The code attempts to execute 2000
pairs of broadcasts.

The hostfile contains a single line
<machinename> slots=8

If I run it with 4 cores or fewer, the code will run fine.

If I run it with 5 cores or more, it will hang some of the time after
successfully executing several hundred broadcasts. The number varies
from run to run. The code usually finishes with 5 cores. The
probability of hanging seems to increase with the number of nodes. The
syntax I use is simple.

mpiexec -machinefile hostfile -np 5 bcast_example

There was some discussion of a similar problem on the user list, but I
could not find a resolution. I have tried setting the processor
affinity (--mca mpi_paffinity_alone 1). I have tried varying the
broadcast algorithm (--mca coll_tuned_bcast_algorithm 1-6). I have also
tried excluding (-mca oob_tcp_if_exclude) my eth1 interface (see
ifconfig.txt attached) which is not connected to anything. None of
these changed the outcome.

Any thoughts or suggestions would be appreciated.

