Jonathan Dursi wrote:
> We have here installed a couple of installations of OpenMPI 1.3.2, and
> we are having real problems with single-node jobs randomly hanging
> when using the shared memory BTL, particularly (but perhaps not only)
> when using the version compiled with gcc 4.4.0.
> The very trivial attached program, which just does a series of
> SENDRECVs rightwards through MPI_COMM_WORLD, hangs extremely
> reliably when run like so on an 8 core box:
> mpirun -np 6 -mca btl self,sm ./diffusion-mpi
> (the test example was based on a simple fortran example of MPIing the
> 1d diffusion equation). The hanging seems to always occur within the
> first 500 or so iterations - but sometimes between the 10th and 20th
> and sometimes not until the late 400s. The hanging occurs both on a
> new dual socket quad core nehalem box, and an older harpertown machine.
> Running without sm, however, seems to work fine:
> mpirun -np 6 -mca btl self,tcp ./diffusion-mpi
> never gives any problems.
> Any suggestions? I notice a mention of `improved flow control in sm'
> in the ChangeLog to 1.3.3; is that a significant bugfix?
I filed a trac ticket on this.