Jonathan Dursi wrote:
> Continuing the conversation with myself:
> Google pointed me to Trac ticket #1944, which spoke of deadlocks in
> looped collective operations; there is no collective operation
> anywhere in this sample code, but trying one of the suggested
> workarounds/clues: that is, setting btl_sm_num_fifos to at least
> (np-1) seems to make things work quite reliably, for both OpenMPI
> 1.3.2 and 1.3.3; that is, while this
> mpirun -np 6 -mca btl sm,self ./diffusion-mpi
> invariably hangs (at random-seeming numbers of iterations) with
> OpenMPI 1.3.2 and sometimes hangs (maybe 10% of the time, again
> seemingly randomly) with 1.3.3,
> mpirun -np 6 -mca btl tcp,self ./diffusion-mpi
> mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi
> always succeeds, with (as one might guess) the second being much
The btl_sm_num_fifos thing doesn't on the surface make much sense to
me. That presumably controls the number of receive FIFOs per process.
The default became 1, which could threaten to change behavior if
multiple senders all send to the same FIFO. But your sample program has
just one-to-one connections. Each receiver has only one sender. So,
the number of FIFOs shouldn't matter. Bumping the number up only means
you allocate some FIFOs that are never used.
Hmm. Continuing the conversation with myself, maybe that's not entirely
true. Whatever fragments are sent by a process must be received back
from the receiver. So, a process receives not only messages from its
left but also return fragments from its right. Still, why would np-1
FIFOs be needed? Why not just 2?
And, as Jeff points out, everyone should be staying in pretty good sync
with the Sendrecv pattern. So, how could there be a problem at all?
Like Jeff, my attempts so far to reproduce the problem (with
hardware/software conveniently accessible to me) have come up empty.