Richard Graham wrote:
Re: [OMPI devel] shared-memory allocations
It does not make a difference who allocates
it, what makes a difference is who touches it first.
Fair enough, but the process that allocates it right away starts to
initialize it. So, each circular buffer is set up (allocated and
initialized/touched) by the sender.
> There is support for multiple circular
buffers per FIFO.
I think there is support for multiple CBs (circular buffers) per FIFO.
This is why there was that recent bug about sm hanging on
unidirectional messaging after so many iterations. The sender would
keep allocating room for the eager free list and on the outbound FIFO
until the shared-memory area was filled. Both the eager free list and
the FIFO could grow "unbounded" (until the shared memory area was
The code is there, but I believe Gleb disabled using multiple fifo's,
and added a list to hold pending messages, so now we are paying two
overheads ... I could be wrong here, but am pretty sure I am not. I
don't know if George has touched the code since.
The cost per process is linear in the total
number of processes, so overall the cost scales as the number of procs
squared. This was designed for small smp's, to reduce coordination
costs between processes, and where memory costs are not large. Once
can go to very simple schemes that are constant with respect to memory
footprint, but then pay the cost of multiple writers to a single queue
- this is what LA-MPI did.
The point was that there are these O(3n^2) allocations -- sometimes
just 12 or 64 bytes apiece -- that are taking up an entire page each
due to page alignment. I understand we're choosing to have O(n^2)
FIFOs. I'm just saying that by aggregating these numerous tiny
allocations, we can make them take up 100x less space.
Patrick Geoffray wrote:
Thanks for all the comments. I think I follow all the reasoning, but
what I was trying to figure out was if the design were based solely on
such reasoning, or also on performance measurements. Again, I tried
some experiments. I had two processes pingpong via shared memory and I
moved the processes and the memory around -- local to sender, local to
receiver, remote from both, etc. I found the pingpong time depended
only the relative positions of the sender and the receiver. It was
unrelated to the position of the shared memory backing the shared
variables. E.g., if the sender and receiver were collocated, I got
best performance -- even if the shared memory was remote to both of
them! I don't know how general this result is, but it's at least one
data point suggesting that the design may be based on reasoning that
might be incomplete.
Yes - it is polling volatile memory, so has
to load from memory on every read.
Actually, it will poll in cache, and only load from memory when the
cache coherency protocol invalidates the cache line. Volatile semantic
only prevents compiler optimizations.
It does not matter much where the pages are (closer to reader or
receiver) on NUMAs, as long as they are equally distributed among all
sockets (ie the choice is consistent). Cache prefetching is slightly
more efficient on local socket, so closer to reader may be a bit
No big deal, but I just wanted to understand the motivation and
rationale for what I see in the code.