Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Hang in collectives involving shared memory
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-06-12 15:11:03

Sylvain Jeaugey wrote:

> Hi Ralph,
> I managed to have a deadlock after a whole night, but not the same you
> have : after a quick analysis, process 0 seems to be blocked in the
> very first send through shared memory. Still maybe a bug, but not the
> same as yours IMO.

Yes, that's the one Terry and I have tried to hunt down. Kind of
elusive. Apparently, there is a race condition in sm start-up. It
*appears* as though a process (the lowest rank on a node?) computes
offsets into shared memory using bad values and ends up with a FIFO
pointer to the wrong spot. Up through 1.3.1, this meant that OMPI would
fail in add_procs()... Jeff and Terry have seen a couple of these. With
changes to sm in 1.3.2, the failure expresses itself differently... not
until the first send (namely, first use of a remote FIFO). At least
that's my understanding. George added some sync to the code to make it
bulletproof. But doesn't seem to have fixed the problem. Sigh.

Anyhow, I think you ran into a different but known yet not understood