Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Dual quad core Opteron hangs on Bcast.
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2010-01-04 05:17:33


have you tried IMB benchmark with Bcast,
I think the problem is in the app.
All ranks in the communicator should enter Bcast,
since you have
if (rank==0)
else state, not all of them enters the same flow.
  if (iRank == 0)
 {
  iLength = sizeof (acMessage);
  MPI_Bcast (&iLength, 1, MPI_INT, 0, MPI_COMM_WORLD);
  MPI_Bcast (acMessage, iLength, MPI_CHAR, 0, MPI_COMM_WORLD);
  printf ("Process 0: Message sent\n");
 }
  else
 {
  MPI_Bcast (&iLength, 1, MPI_INT, 0, MPI_COMM_WORLD);
  pMessage = (char *) malloc (iLength);
  MPI_Bcast (pMessage, iLength, MPI_CHAR, 0, MPI_COMM_WORLD);
  printf ("Process %d: %s\n", iRank, pMessage);
 }

Lenny.

On Mon, Jan 4, 2010 at 8:23 AM, Eugene Loh <Eugene.Loh_at_[hidden]> wrote:

> If you're willing to try some stuff:
>
> 1) What about "-mca coll_sync_barrier_before 100"? (The default may be
> 1000. So, you can try various values less than 1000. I'm suggesting 100.)
> Note that broadcast has somewhat one-way traffic flow, which can have some
> undesirable flow control issues.
>
> 2) What about "-mca btl_sm_num_fifos 16"? Default is 1. If the problem is
> trac ticket 2043, then this suggestion can help.
>
> P.S. There's a memory leak, right? The receive buffer is being allocated
> over and over again. Might not be that closely related to the problem you
> see here, but at a minimum it's bad style.
>
> Louis Rossi wrote:
>
> I am having a problem with BCast hanging on a dual quad core Opteron (2382,
> 2.6GHz, Quad Core, 4 x 512KB L2, 6MB L3 Cache) system running FC11 with
> openmpi-1.4. The LD_LIBRARY_PATH and PATH variables are correctly set. I
> have used the FC11 rpm distribution of openmpi and built openmpi-1.4 locally
> with the same results. The problem was first observed in a larger reliable
> CFD code, but I can create the problem with a simple demo code (attached).
> The code attempts to execute 2000 pairs of broadcasts.
>
> The hostfile contains a single line
> <machinename> slots=8
>
> If I run it with 4 cores or fewer, the code will run fine.
>
> If I run it with 5 cores or more, it will hang some of the time after
> successfully executing several hundred broadcasts. The number varies from
> run to run. The code usually finishes with 5 cores. The probability of
> hanging seems to increase with the number of nodes. The syntax I use is
> simple.
>
> mpiexec -machinefile hostfile -np 5 bcast_example
>
> There was some discussion of a similar problem on the user list, but I
> could not find a resolution. I have tried setting the processor affinity
> (--mca mpi_paffinity_alone 1). I have tried varying the broadcast algorithm
> (--mca coll_tuned_bcast_algorithm 1-6). I have also tried excluding (-mca
> oob_tcp_if_exclude) my eth1 interface (see ifconfig.txt attached) which is
> not connected to anything. None of these changed the outcome.
>
> Any thoughts or suggestions would be appreciated.
>
> ------------------------------
>
> _______________________________________________
> users mailing listusers_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>