Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] shared-memory allocations
From: Richard Graham (rlgraham_at_[hidden])
Date: 2008-12-12 14:48:06

It has been a long time since I wrote the original code, and things have
changed a fair amount since that time, so bear this in mind.

The memory allocation is intended to take into account that two separate
procs may be touching the same memory, so the intent is to reduce cache
conflicts (false sharing) and put the memory close to the process that is
using it. When the code first went in, there was no explicit memory
affinity implemented, so first-touch was relied on to get the memory in the
³correct² location.

If I remember correctly, the head and the tail each are written to be a
different process, and is where the pointers and counters used to manage the
fifo are maintained. They need to be close to the writer, and on separate
cache lines, to avoid false sharing. The queue itself is accessed most
often by the reader, so it should be closer to the reader. I honestly don¹t
remember much about the wrapper ­ would have to go back to the code to look
at it. If we no longer allow multiple fifo per pair, the wrapper layer can
go away ­ it is there to manage multiple fifo¹s per pair.

As far as granularity of allocation ­ it needs to be large enough to
accommodate the smallest shared memory hierarchy, so I suppose in the most
general case this may be the tertiary cache ?

No reason not to allocate objects that need to be associated with the same
process on the same page, as long as one avoids false sharing. So seems
like each process could have all of it¹s receive fifo¹s on the same page,
and these could share the also with either the heads, or the tails of each

Make sense ?

On 12/10/08 1:11 PM, "Eugene Loh" <Eugene.Loh_at_[hidden]> wrote:

> For shared memory communications, each on-node connection (non-self,
> sender-receiver pair) gets a circular buffer during MPI_Init(). Each CB
> requires the following allocations:
> *) ompi_cb_fifo_wrapper_t (roughly 64 bytes)
> *) ompi_cb_fifo_ctl_t head (roughly 12 bytes)
> *) ompi_cb_fifo_ctl_t tail (roughly 12 bytes)
> *) queue (roughly 1024 bytes)
> Importantly, the current code lays these four allocations out on three
> separate pages. (The tail and queue are aggregated together.) So, for
> example, that "head" allocation (12 bytes) ends up consuming a full page.
> As one goes to more and more on-node processes -- say, for a large SMP
> or a multicore system -- the number of non-self connections grows as
> n*(n-1). So, these circular-buffer allocations end up consuming a lot
> of shared memory.
> For example, for a 4K pagesize and n=512 on-node processes, the circular
> buffers consume 3 Gbyte of memory -- 90% of which is empty and simply
> used for page alignment.
> I'd like to aggregate more of these allocations so that:
> *) shared-memory consumption is reduced
> *) the number of allocations (and hence the degree of lock contention)
> during MPI_Init is reduced
> Any comments?
> I'd like to understand the original rationale for these page
> alignments. I expect this is related to memory placement of pages. So,
> I imagine three scenarios. Which is it?
> A) There really is a good reason for each allocation to have its own
> page and any attempt to aggregate is doomed.
> B) There is actual benefit for placing things carefully in memory, but
> substantial aggregation is still possible. That is, for n processes, we
> need at most n different allocations -- not 3*n*(n-1).
> C) There is no actual justification for having everything on different
> pages. That is, allowing different parts of a FIFO CB to be mapped
> differently to physical memory sounded to someone like a good idea at
> the time, but no one really did any performance measurements to justify
> this. Or, if they did, it was only on one platform and we have no
> evidence that the same behavior exists on all platforms. Personally,
> I've played with some simple experiments on one (or more?) platforms and
> found no performance variations due to placement of shared variables
> that two processes use for communication. I guess it's possible that
> data is moving cache-to-cache and doesn't care where the backing memory is.
> Note that I only want to reduce the number of page-aligned allocations.
> I'd preserve cacheline alignment. So, no worry about false sharing due
> to a sender thrashing on one end of a FIFO and a receiver on the other.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]