Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-11 12:19:41


Ralph Castain wrote:

> Could be nobody is saying anything...but I would be surprised if -
> nobody- barked at a segfault during startup.

Well, if it segfaulted during startup, someone's first reaction would
probably be, "Oh really?" They would try again, have success, attribute
to cosmic rays, and move on. But, yes, it is presumably rare
(reasonably measured in parts per million), and the failure is early and
obvious. And in code that is due to change very soon.

I don't understand what's going on, but I guess each process is calling
sm_btl_first_time_init(), during which it initializes its own shm_bases
value, FIFOs, and shm_fifo pointer. If a remote process saw those
memory operations in that order, then initialization of the shm_fifo
pointer would be an indicator that the rest of the data structures had
been initialized. But there are no memory barriers between those
operations to order them. So, perhaps testing the shm_fifo pointer
doesn't really mean much. I don't know enough about memory coherency to
know.

I think Terry has seen
https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/btl/sm/btl_sm.c?r=20298#520
produce a wild "diff" value (between local and remote "bases"), even
though it was supposed to be 0. I could see this happening if one saw
the updates to the remote bases and shm_fifo values in the "wrong" order.

Jeff said he saw a problem at
https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/btl/sm/btl_sm.c?r=20298#529
. He says he sees reasonable values for .fifo[j][...], so this would be
harder to explain.