Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Dual quad core Opteron hangs on Bcast.
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2010-01-04 01:23:32

If you're willing to try some stuff:

1) What about "-mca coll_sync_barrier_before 100"?  (The default may be 1000.  So, you can try various values less than 1000.  I'm suggesting 100.)  Note that broadcast has somewhat one-way traffic flow, which can have some undesirable flow control issues.

2) What about "-mca btl_sm_num_fifos 16"?  Default is 1.  If the problem is trac ticket 2043, then this suggestion can help.

P.S.  There's a memory leak, right?  The receive buffer is being allocated over and over again.  Might not be that closely related to the problem you see here, but at a minimum it's bad style.

Louis Rossi wrote:
I am having a problem with BCast hanging on a dual quad core Opteron (2382, 2.6GHz, Quad Core, 4 x 512KB L2, 6MB L3 Cache) system running FC11 with openmpi-1.4.  The LD_LIBRARY_PATH and PATH variables are correctly set.  I have used the FC11 rpm distribution of openmpi and built openmpi-1.4 locally with the same results.  The problem was first observed in a larger reliable CFD code, but I can create the problem with a simple demo code (attached).  The code attempts to execute 2000 pairs of broadcasts.

The hostfile contains a single line
<machinename> slots=8

If I run it with 4 cores or fewer, the code will run fine.

If I run it with 5 cores or more, it will hang some of the time after successfully executing several hundred broadcasts.  The number varies from run to run.  The code usually finishes with 5 cores.  The probability of hanging seems to increase with the number of nodes.  The syntax I use is simple.

mpiexec -machinefile hostfile -np 5 bcast_example

There was some discussion of a similar problem on the user list, but I could not find a resolution.  I have tried setting the processor affinity (--mca mpi_paffinity_alone 1).  I have tried varying the broadcast algorithm (--mca coll_tuned_bcast_algorithm 1-6).  I have also tried excluding (-mca oob_tcp_if_exclude) my eth1 interface (see ifconfig.txt attached) which is not connected to anything.  None of these changed the outcome.

Any thoughts or suggestions would be appreciated.

_______________________________________________ users mailing list