Richard Graham wrote:
Re: [OMPI devel] shared-memory allocations
The memory allocation is intended to take
into account that two separate procs may be touching the same memory,
so the intent is to reduce cache conflicts (false sharing)
Got it. I'm totally fine with that. Separate cachelines.
and put the memory close to the process that
is using it.
Problematic concept, but ... okay, I'll read on.
When the code first went in, there was no
explicit memory affinity implemented, so first-touch was relied on to
get the memory in the “correct” location.
If I remember correctly, the head and the
tail each are written to be a different process, and is where the
pointers and counters used to manage the fifo are maintained. They
need to be close to the writer, and on separate cache lines, to avoid
Why close to the writer (versus reader)?
Anyhow, so far as I can tell, the 2d structure ompi_fifo_t
fifo[receiver][sender] is organized by receiver. That is, the main
ompi_fifo_t FIFO data structures are local to receivers.
But then, each FIFO is initialized (that is, circular buffers and
associated allocations) by senders. E.g.,
In the call to ompi_fifo_init(), all the circular buffer (CB) data
structures are allocated by the sender. On different cachelines --
even different pages -- but all by the sender.
Specifically, one accesses FIFO on the receiver side then follow
pointers to the senders side. Doesn't matter if you're talking head,
tail, or queue.
The queue itself is accessed most often by
You mean because it's polling often, but writer writes only once?
so it should be closer to the reader.
Are there measurements to substantiate this? Seems to me that in a
cache-based system, a reader could poll on a remote location all it
wanted and there'd be traffic only if the cached copy were
invalidated. Conceivably, a transfer could go cache-to-cache and not
hit memory at all. I tried some measurements and found no difference
for any location -- close to writer, close to reader, or far from both.
I honestly don’t remember much about the
wrapper – would have to go back to the code to look at it. If we no
longer allow multiple fifo per pair, the wrapper layer can go away – it
is there to manage multiple fifo’s per pair.
There is support for multiple circular buffers per FIFO.
As far as granularity of allocation – it
needs to be large enough to accommodate the smallest shared memory
hierarchy, so I suppose in the most general case this may be the
tertiary cache ?
I don't get this. I understand how certain things should be on
separate cachelines. Beyond that, we just figure out what should be
local to a process and allocate all those things together. That takes
us from 3*n*n allocations (and pages) to just n of them.
No reason not to allocate objects that need
to be associated with the same process on the same page, as long as one
avoids false sharing.
So seems like each process could have all of
it’s receive fifo’s on the same page, and these could share the also
with either the heads, or the tails of each queue.
I will propose some specifics and run them by y'all. I think I know
enough to get started. Thanks for the comments.