Jeff Squyres wrote:
> FWIW, George found what looks like a race condition in the sm init
> code today -- it looks like we don't call maffinity anywhere in the
> sm btl startup, so we're not actually guaranteed that the memory is
> local to any particular process(or) (!). This race shouldn't cause
> segvs, though; it should only mean that memory is potentially farther
> away than we intended.
Is this that business that came up recently on one of these mail lists
about setting the memory node to -1 rather than using the value we know
it should be? In mca_mpool_sm_alloc(), I do see a call to
> The central question is: does "first touch" mean both read and
> write? I.e., is the first process that either reads *or* writes to a
> given location considered "first touch"? Or is it only the first write?
So, maybe the strategy is to create the shared area, have each process
initialize its portion (FIFOs and free lists), have all processes sync,
and then move on. That way, you know all memory will be written by the
appropriate owner before it's read by anyone else. First-touch
ownership will be proper and we won't be dependent on zero-filled pages.
The big question in my mind remains that we don't seem to know how to
reproduce the failure (segv) that we're trying to fix. I, personally,
am reluctant to stick fixes into the code for problems I can't observe.