This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
Sylvain Jeaugey wrote:
> Hi Ralph,
> I managed to have a deadlock after a whole night, but not the same you
> have : after a quick analysis, process 0 seems to be blocked in the
> very first send through shared memory. Still maybe a bug, but not the
> same as yours IMO.
Yes, that's the one Terry and I have tried to hunt down. Kind of
elusive. Apparently, there is a race condition in sm start-up. It
*appears* as though a process (the lowest rank on a node?) computes
offsets into shared memory using bad values and ends up with a FIFO
pointer to the wrong spot. Up through 1.3.1, this meant that OMPI would
fail in add_procs()... Jeff and Terry have seen a couple of these. With
changes to sm in 1.3.2, the failure expresses itself differently... not
until the first send (namely, first use of a remote FIFO). At least
that's my understanding. George added some sync to the code to make it
bulletproof. But doesn't seem to have fixed the problem. Sigh.
Anyhow, I think you ran into a different but known yet not understood