Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Dual quad core Opteron hangs on Bcast.
From: Louis Rossi (rossi_at_[hidden])
Date: 2010-01-04 01:04:32

I am having a problem with BCast hanging on a dual quad core Opteron
(2382, 2.6GHz, Quad Core, 4 x 512KB L2, 6MB L3 Cache) system running
FC11 with openmpi-1.4. The LD_LIBRARY_PATH and PATH variables are
correctly set. I have used the FC11 rpm distribution of openmpi and
built openmpi-1.4 locally with the same results. The problem was first
observed in a larger reliable CFD code, but I can create the problem
with a simple demo code (attached). The code attempts to execute 2000
pairs of broadcasts.

The hostfile contains a single line
<machinename> slots=8

If I run it with 4 cores or fewer, the code will run fine.

If I run it with 5 cores or more, it will hang some of the time after
successfully executing several hundred broadcasts. The number varies
from run to run. The code usually finishes with 5 cores. The
probability of hanging seems to increase with the number of nodes. The
syntax I use is simple.

mpiexec -machinefile hostfile -np 5 bcast_example

There was some discussion of a similar problem on the user list, but I
could not find a resolution. I have tried setting the processor
affinity (--mca mpi_paffinity_alone 1). I have tried varying the
broadcast algorithm (--mca coll_tuned_bcast_algorithm 1-6). I have also
tried excluding (-mca oob_tcp_if_exclude) my eth1 interface (see
ifconfig.txt attached) which is not connected to anything. None of
these changed the outcome.

Any thoughts or suggestions would be appreciated.

"Through nonaction, no action is left undone." --Lao Tzu
Louis F. Rossi				rossi_at_[hidden]
Department of Mathematical Sciences
University of Delaware			(302) 831-1880 (voice)
Newark, DE 19716			(302) 831-4511 (fax)