Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-09-23 15:05:22


Jonathan Dursi wrote:

> Continuing the conversation with myself:

Sorry to interrupt... :^)

Okay, I managed to reproduce the hang. I'll try to look at this.

>
> Google pointed me to Trac ticket #1944, which spoke of deadlocks in
> looped collective operations; there is no collective operation
> anywhere in this sample code, but trying one of the suggested
> workarounds/clues: that is, setting btl_sm_num_fifos to at least
> (np-1) seems to make things work quite reliably, for both OpenMPI
> 1.3.2 and 1.3.3; that is, while this
>
> mpirun -np 6 -mca btl sm,self ./diffusion-mpi
>
> invariably hangs (at random-seeming numbers of iterations) with
> OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again
> seemingly randomly) with 1.3.3,
>
> mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
>
> or
>
> mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
>
> always succeeds, with (as one might guess) the second being much
> faster...
>
> Jonathan
>