Subject: Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-09-22 20:52:46

Johnathan --

Sorry for the delay in replying; thanks for posting again.

I'm actually unable to replicate your problem. :-( I have a new
intel 8 core X5570 box; I'm running at np6 and np8 on both Open MPI
1.3.2 and 1.3.3 and am not seeing the problem you're seeing. I even
made your sample program worse -- I made a and b be 100,000 element
real arrays (increasing the count args in MPI_SENDRECV to 100,000 as
well), and increased nsteps to 150,000,000. No hangs. :-\

The version of the compiler *usually* isn't significant, so gcc 4.x
should be fine.

Yes, the sm flow control issue was a significant fix, but the blocking
nature of MPI_SENDRECV means that you shouldn't have run into the
problems that were fixed (the main issues had to do with fast senders
exhausting resources of slow receivers -- but MPI_SENDRECV is
synchronous so the senders should always be matching the speed of the

Just for giggles, what happens if you change

       if (leftneighbour .eq. -1) then
          leftneighbour = nprocs-1
       if (rightneighbour .eq. nprocs) then
          rightneighbour = 0


       if (leftneighbour .eq. -1) then
          leftneighbour = MPI_PROC_NULL
       if (rightneighbour .eq. nprocs) then
          rightneighbour = MPI_PROC_NULL

On Sep 21, 2009, at 5:09 PM, Jonathan Dursi wrote:

> Continuing the conversation with myself:
> Google pointed me to Trac ticket #1944, which spoke of deadlocks in
> looped collective operations; there is no collective operation
> anywhere in this sample code, but trying one of the suggested
> workarounds/clues: that is, setting btl_sm_num_fifos to at least
> (np-1) seems to make things work quite reliably, for both OpenMPI
> 1.3.2 and 1.3.3; that is, while this
> mpirun -np 6 -mca btl sm,self ./diffusion-mpi
> invariably hangs (at random-seeming numbers of iterations) with
> OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again
> seemingly randomly) with 1.3.3,
> mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
> or
> mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
> always succeeds, with (as one might guess) the second being much
> faster...
> Jonathan
Jeff Squyres