I am having a problem with BCast hanging on a dual quad core Opteron
(2382, 2.6GHz, Quad Core, 4 x 512KB L2, 6MB L3 Cache) system running
FC11 with openmpi-1.4. The LD_LIBRARY_PATH and PATH variables are
correctly set. I have used the FC11 rpm distribution of openmpi and
built openmpi-1.4 locally with the same results. The problem was first
observed in a larger reliable CFD code, but I can create the problem
with a simple demo code (attached). The code attempts to execute 2000
pairs of broadcasts.
The hostfile contains a single line
If I run it with 4 cores or fewer, the code will run fine.
If I run it with 5 cores or more, it will hang some of the time after
successfully executing several hundred broadcasts. The number varies
from run to run. The code usually finishes with 5 cores. The
probability of hanging seems to increase with the number of nodes. The
syntax I use is simple.
mpiexec -machinefile hostfile -np 5 bcast_example
There was some discussion of a similar problem on the user list, but I
could not find a resolution. I have tried setting the processor
affinity (--mca mpi_paffinity_alone 1). I have tried varying the
broadcast algorithm (--mca coll_tuned_bcast_algorithm 1-6). I have also
tried excluding (-mca oob_tcp_if_exclude) my eth1 interface (see
ifconfig.txt attached) which is not connected to anything. None of
these changed the outcome.
Any thoughts or suggestions would be appreciated.
"Through nonaction, no action is left undone." --Lao Tzu
Louis F. Rossi rossi_at_[hidden]
Department of Mathematical Sciences http://www.math.udel.edu/~rossi
University of Delaware (302) 831-1880 (voice)
Newark, DE 19716 (302) 831-4511 (fax)