> On 12/12/08 8:21 PM, "Eugene Loh" <Eugene.Loh@sun.com> wrote:
> Richard Graham wrote:
> Re: [OMPI devel] shared-memory allocations The memory allocation is intended to take into account that two separate procs may be touching the same memory, so the intent is to reduce cache conflicts (false sharing)
> Got it. I'm totally fine with that. Separate cachelines.
> and put the memory close to the process that is using it.
> Problematic concept, but ... okay, I'll read on.
> When the code first went in, there was no explicit memory affinity implemented, so first-touch was relied on to get the memory in the øcorrectø location.
> If I remember correctly, the head and the tail each are written to be a different process, and is where the pointers and counters used to manage the fifo are maintained. They need to be close to the writer, and on separate cache lines, to avoid false sharing.
> Why close to the writer (versus reader)?
> Anyhow, so far as I can tell, the 2d structure ompi_fifo_t fifo[receiver][sender] is organized by receiver. That is, the main ompi_fifo_t FIFO data structures are local to receivers.
> But then, each FIFO is initialized (that is, circular buffers and associated allocations) by senders. E.g., https://svn.open-mpi.org/trac/ompi/browser/branches/v1.3/ompi/mca/btl/Smylers/btl_sm.c?version=19785#L537
> In the call to ompi_fifo_init(), all the circular buffer (CB) data structures are allocated by the sender. On different cachelines -- even different pages -- but all by the sender.
It does not make a difference who allocates it, what makes a difference is who touches it first.
> Specifically, one accesses FIFO on the receiver side then follow pointers to the senders side. Doesn't matter if you're talking head, tail, or queue.
> The queue itself is accessed most often by the reader,
> You mean because it's polling often, but writer writes only once?
Yes - it is polling volatile memory, so has to load from memory on every read.
> so it should be closer to the reader.
> Are there measurements to substantiate this? Seems to me that in a cache-based system, a reader could poll on a remote location all it wanted and there'd be traffic only if the cached copy were invalidated. Conceivably, a transfer could go cache-to-cache and not hit memory at all. I tried some measurements and found no difference for any location -- close to writer, close to reader, or far from both.
> I honestly donøt remember much about the wrapper ø would have to go back to the code to look at it. If we no longer allow multiple fifo per pair, the wrapper layer can go away ø it is there to manage multiple fifoøs per pair.
> There is support for multiple circular buffers per FIFO.
The code is there, but I believe Gleb disabled using multiple fifo's, and added a list to hold pending
messages, so now we are paying two overheads ... I could be wrong here, but am pretty sure I am not.
I don't know if George has touched the code since.
> As far as granularity of allocation ø it needs to be large enough to accommodate the smallest shared memory hierarchy, so I suppose in the most general case this may be the tertiary cache ?
> I don't get this. I understand how certain things should be on separate cachelines. Beyond that, we just figure out what should be local to a process and allocate all those things together. That takes us from 3*n*n allocations (and pages) to just n of them.
Not sure what you point is here. The cost per process is linear in the total number of processes, so
overall the cost scales as the number of procs squared. This was designed for small smp's, to reduce
coordination costs between processes, and where memory costs are not large. Once can go to very simple
schemes that are constant with respect to memory footprint, but then pay the cost of multiple writers
to a single queue - this is what LA-MPI did.
> No reason not to allocate objects that need to be associated with the same process on the same page, as long as one avoids false sharing.
> Got it.
> So seems like each process could have all of itøs receive fifoøs on the same page, and these could share the also with either the heads, or the tails of each queue.
Yes, this makes sense.
> I will propose some specifics and run them by y'all. I think I know enough to get started. Thanks for the comments.