Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM init failures
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-30 14:59:56


Patrick Geoffray wrote:

> Jeff Squyres wrote:
>
>> Why not? The "owning" process can do the touch; then it'll be
>> affinity'ed properly. Right?
>
> Yes, that's what I meant by forcing allocation. From the thread, it
> looked like nobody touched the pages of the mapped file. If it's
> already done, no need to write in the whole file.

The shared area is used for two kinds of data structures: FIFOs and
fragments. Fragments are first touched (written) by their senders.
FIFOs are complicated data structures that used (up to 1.3.1) to be
mapped all over the place -- parts local to sender and parts local to
receiver. Receivers would touch their part. Once senders believed the
receivers set their stuff up, the senders would initialize their parts.

The stuff that occurs "0.01%" of the time that Jeff and Terry saw looked
to me like a memory race condition. That is, a receiver would
initialize some memory and then publish a pointer. A sender, upon
seeing the pointer, would assume the corresponding memory was
initialized. But, there weren't a whole lot of memory barriers
anywhere, and I've wondered whether the sender might see
"pre-initialized" memory. I just don't know.

The stuff that occurs "1%" of the time (e.g., in MTT logs noted by Ralph
recently) might be something else.

Anyhow, the first touch should all be happening properly from an
affinity point of view and the reason we want zerofill is so that that
sender/receiver coordination happens properly (and there may be other
ways of addressing that). And, most of all, lots of mysteries remain.