Subject: [OMPI users] Random-ish hangs using btl sm with OpenMPI 1.3.2 + gcc4.4?
From: Jonathan Dursi (ljdursi_at_[hidden])
Date: 2009-09-15 14:17:05

We have here installed a couple of installations of OpenMPI 1.3.2, and
we are having real problems with single-node jobs randomly hanging when
using the shared memory BTL, particularly (but perhaps not only) when
using the version compiled with gcc 4.4.0.

The very trivial attached program, which just does a series of SENDRECVs
  rightwards through MPI_COMM_WORLD, hangs extremely reliably when run
like so on an 8 core box:

mpirun -np 6 -mca btl self,sm ./diffusion-mpi

(the test example was based on a simple fortran example of MPIing the 1d
diffusion equation). The hanging seems to always occur within the
first 500 or so iterations - but sometimes between the 10th and 20th and
sometimes not until the late 400s. The hanging occurs both on a new
dual socket quad core nehalem box, and an older harpertown machine.

Running without sm, however, seems to work fine:

mpirun -np 6 -mca btl self,tcp ./diffusion-mpi

never gives any problems.

Any suggestions? I notice a mention of `improved flow control in sm' in
the ChangeLog to 1.3.3; is that a significant bugfix?

        - Jonathan

Jonathan Dursi     <ljdursi_at_[hidden]>

       program diffuse
       implicit none
       include "mpif.h"
       integer nsteps
       parameter (nsteps = 150000)
       integer step

       real a,b

       integer ierr
       integer mpistatus(MPI_STATUS_SIZE)
       integer nprocs,rank
       integer leftneighbour, rightneighbour
       integer tag

       call MPI_INIT(ierr)
       call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
       call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)

       leftneighbour = rank-1
       if (leftneighbour .eq. -1) then
          leftneighbour = nprocs-1
       rightneighbour = rank+1
       if (rightneighbour .eq. nprocs) then
          rightneighbour = 0

       tag = 1

       do step=1, nsteps
           call MPI_SENDRECV(a,1,MPI_REAL,rightneighbour, &
     & tag, &
     & b, 1, MPI_REAL, leftneighbour, &
     & tag, &
     & MPI_COMM_WORLD, mpistatus, ierr)
           if ((rank .eq. 0) .and. (mod(step,10) .eq. 1)) then
                   print *, 'Step = ', step
       call MPI_FINALIZE(ierr)