Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Fragmented sm Allocations
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-14 09:50:32


Whoa, this analysis rocks. :-) I'm going through trying to grok it
all...

Just wanted to say: kudos for this.

On Jan 14, 2009, at 1:14 AM, Eugene Loh wrote:

>
> RFC: Fragmented sm Allocations
> WHAT: Dealing with the fragmented allocations of sm BTL FIFO
> circular buffers (CB) during MPI_Init().
>
> Also:
>
> • Improve handling of error codes.
> • Automate the sizing of the mmap file.
> WHY: To reduce consumption of shared memory, making job startup more
> robust, and possibly improving the scalability of startup.
>
> WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to
> ompi_fifo_init(). This is where CBs are initialized one at a time,
> components of a CB allocated individually. Changes can be seen in ssh://www.open-mpi.org/
> ~eugene/hg/sm-allocation.
>
> WHEN: Upon acceptance.
>
> TIMEOUT: January 30, 2009.
>
> WHY (details)
> The sm BTL establishes a FIFO for each non-self, on-node connection.
> Each FIFO is initialized during MPI_Init() with a circular buffer
> (CB). (More CBs can be added later in program execution if a FIFO
> runs out of room.)
>
> A CB has different components that are used in different ways:
>
> • The "wrapper" is read by both sender and receiver, but is rarely
> written.
> • The "queue" (FIFO entries) is accessed by both the sender and
> receiver.
> • The "head" is accessed by the sender.
> • The "tail" is accessed by the receiver.
> For performance reasons, a CB is not allocated as one large data
> structure. Rather, these components are laid out separately in
> memory and the wrapper has pointers to the various locations.
> Performance considerations include:
>
> • false sharing: a component used by one process should not share a
> cacheline with another component that is modified by another process
> • NUMA: certain components should perhaps be mapped preferentially
> to memory pages that are close to the processes that access these
> components
> Currently, the sm BTL handles these issues by allocating each
> component of each CB its own page. (Actually, it couples tails and
> queues together.)
>
> As the number of on-node processes grows, however, the shared-memory
> allocation skyrockets. E.g., let's say there are n processes on-
> node. There are therefore n(n-1) = O(n2) FIFOs, each with 3
> allocations (wrapper, head, and tail/queue). The shared-memory
> allocation for CBs becomes 3n2 pages. For large n, this dominates
> the shared-memory consumption, even though most of the CB allocation
> is unused. E.g., a 12-byte "head" ends up consuming a full memory
> page!
>
> Not only is the 3n2-page allocation large, but it is also not
> tunable via any MCA parameters.
>
> Large shared-memory consumption has led to some number of start-up
> and other user problems. E.g., the e-mail thread at http://www.open-mpi.org/community/lists/devel/2008/11/4882.php
> .
>
> WHAT (details)
> Several actions are recommended here.
>
> 1. Cacheline Rather than Pagesize Alignment
> The first set of changes reduces pagesize to cacheline alignment.
> Though mapping to pages is motivated by NUMA locality, note:
>
> • The code already has NUMA locality optimizations (maffinity and
> mpools) anyhow.
> • There is no data that I'm aware of substantiating the benefits of
> locality optimizations in this context.
> More to the point, I've tried some such experiments myself. I had
> two processes communicating via shared memory on a large SMP that
> had a large difference between remote and local memory access times.
> I timed the roundtrip latency for pingpongs between the processes.
> That time was correlated to the relative separation between the two
> processes, and not at all to the placement of the physical memory
> backing the shared variables. It did not matter if the memory was
> local to the sender or receiver or remote from both! (E.g., colocal
> processes showed fast timings even if the shared memory were remote
> to both processes.)
>
> My results do not prove that all NUMA platforms behave in the same
> way. My point is only that, though I understand the logic behind
> locality optimizations for FIFO placement, the only data I am aware
> of does not substantiate that logic.
>
> The changes are:
>
> • File: ompi/mca/mpool/sm/mpool_sm_module.c
> Function: mca_mpool_sm_alloc()
> Use the alignment requested by the caller rather than adding
> additional pagesize alignment as well.
>
> • File: ompi/class/ompi_fifo.h
> Function: ompi_fifo_init() and ompi_fifo_write_to_head()
> Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.
>
> • File: ompi/class/ompi_circular_buffer_fifo.h
> Function: ompi_cb_fifo_init()
> Align the two calls to mpool_alloc on cacheline rather than page.
>
> 2. Aggregated Allocation
> Another option is to lay out all the CBs at once and aggregate their
> allocations.
>
> This may have the added benefit of reducing lock contention during
> MPI_Init(). On the one hand, the 3n2 CB allocations during
> MPI_Init() contend for a single mca_common_sm_mmap->map_seg-
> >seg_lock lock. On the other hand, I know so far of no data showing
> that this lock contention is impairing start-up scalability.
>
> The objectives here would be to consolidate many CB components
> together subject to:
>
> • queues should be local to receivers
> • heads should be local to senders
> • tails should be local to receivers
> • wrappers should not share cachelines with heads or tails
> In sum, for process myrank, the FIFO allocation in shared memory
> during MPI_Init() looks something like this:
>
> ompi_fifo_t from 0 to myrank
> ompi_fifo_t from 1 to myrank
> ompi_fifo_t from 2 to myrank
> ...
> ompi_fifo_t from n-1 to myrank
> --- cacheline boundary ---
> queue of pointers, for CB from 0 to myrank
> queue of pointers, for CB from 1 to myrank
> queue of pointers, for CB from 2 to myrank
> ...
> queue of pointers, for CB from n-1 to myrank
> --- cacheline boundary ---
> head for CB from myrank to 0
> tail for CB from 0 to myrank
> head for CB from myrank to 1
> tail for CB from 1 to myrank
> head for CB from myrank to 2
> tail for CB from 2 to myrank
> ...
> head for CB from myrank to n-1
> tail for CB from n-1 to myrank
> --- cacheline boundary ---
> wrapper, CB from 0 to myrank
> wrapper, CB from 1 to myrank
> wrapper, CB from 2 to myrank
> ...
> wrapper, CB from n-1 to myrank
>
> Note that out-bound heads and in-bound tails are interwoven. There
> should be no false sharing, however, even if multiple heads and
> tails share the same cacheline, since they're all accessed by the
> same process.
> For multi-threaded processes, however, it is conceivable that
> different threads will be passing many messages to different peers.
> So, for multi-threaded jobs, we spread heads and tails out onto
> their own cachelines. Consuming the extra shared memory in the multi-
> threaded case is probably tolerable since the on-node-process count
> will be lower.
>
> The changes are:
>
> • File: ompi/class/ompi_circular_buffer_fifo.h
> Function: ompi_cb_fifo_init()
> Instead of having ompi_cb_fifo_init() allocate each component of a
> CB separately, change the interface to allow the caller function to
> pass an array of addresses in. These addresses will be assumed to be
> allocated and should be used for the CB components.
>
> If the "array of addresses" is NULL, then we allocate the CB
> components as before.
>
> Here is pseudocode to illustrate this change. We replace
>
> fifo->head = mpool_alloc(...); /* allocate head */
> if ( NULL == fifo->head )
> return OMPI_ERR_OUT_OF_RESOURCE;
>
> with
> if ( NULL != cb_addr ) {
> fifo->head = cb_addr[1]; /* use supplied address,
> if available */
> } else {
> fifo->head = mpool_alloc(...); /* allocate head */
> if ( NULL == fifo->head )
> return OMPI_ERR_OUT_OF_RESOURCE;
> }
>
> • File: ompi/class/ompi_fifo.h
> Function: ompi_fifo_init()
> Change the interface to ompi_fifo_init() to allow the caller to
> supply addresses to use for CB components rather than having callees
> allocate memory for them.
>
> Use such a supplied address, if available, when allocating the FIFO
> (not CB) head.
>
> Function: ompi_fifo_init()
> Change the call in ompi_fifo_write_to_head() to ompi_cb_fifo_init()
> to reflect the new interface, passing NULL as the new argument.
>
> • File: ompi/mca/btl/sm/btl_sm.c
> Function: compute_initial_cb_fifo_space() and
> compute_initial_cb_fifo_addrs()
> Add these two functions to compute how much space will be needed for
> the aggregated CB allocation and to compute addresses for individual
> CB components for a particular sender and receiver.
>
> Function: sm_btl_first_time_init()
>
> Increase the allocation of FIFOs (call to mpool_calloc) to include
> room for the CBs.
>
> Function: mca_btl_sm_add_procs()
>
> Compute the addresses where CB components should be and pass them
> into ompi_fifo_init().
>
> These changes impact only the allocation of CBs during MPI_Init().
> If FIFOs are grown later during program execution, they will
> continue to have components allocated in a fragmented manner.
>
> 3. Free List Return Codes
> This is unrelated to FIFOs, but is related to more robust handling
> of shared-memory allocation.
>
> The function sm_btl_first_time_init() should test the return values
> when it allocates free lists. It currently does not test return
> values, proceeding without a hiccup even if those allocations
> indicate an error. The proposed change is:
>
> • File: ompi/mca/btl/sm/btl_sm.c
> Function: sm_btl_first_time_init()
> Test the return codes from the calls to:
>
> ompi_free_list_init_new()
> ompi_free_list_init_new()
> opal_free_list_init()
>
> returning non-successful error status in case of error.
> 4. Better Automatic Sizing of mmap File
> Currently, the size of the file to be mmaped is governed by three
> MCA parameters:
>
> • mpool_sm_max_size
> • mpool_sm_min_size (default 128 Mbytes)
> • mpool_sm_per_peer_size (default 32 Mbytes)
> Specifically, the file size is
>
> min(mpool_sm_max_size,
> max(mpool_sm_min_size,
> n * mpool_sm_per_peer_size))
> This file size is a poor approximation for the actual amount of
> shared memory needed by an application during MPI_Init(). E.g., at
> n=2, the file is 128M even though less than 1M is needed. At large
> n, however, the file is insufficiently small.
>
> Instead, we should add code that produces a better estimate of how
> much shared memory will be needed during MPI_Init().
>
> Regarding the MCA parameters:
>
> • mpool_sm_max_size: default should be 0 (no limit)
> • mpool_sm_min_size: default should be 0 (no limit)
> • mpool_sm_per_peer_size: should be eliminated
> More accurate sizing could help reduce the problems users see
> starting sm jobs with large on-node-process counts.
>
> One problem is that the size of the shared file is set by mpool_sm,
> but information about how much shared memory needs to be allocated
> during MPI_Init() is in btl_sm. Since OMPI doesn't easily allow
> components to call one another, we're stuck.
>
> Supporting Data Memory Consumption
> Memory consumption can be measured or modeled. (I have a byte-
> accurate model.)
>
> Here are some comparisons for the case of:
>
> • 1024 on-node processes
> • 8-byte pointers (LP64 execution model)
> • 4K eager limit
> • 32K chunk size
> • 128-byte cacheline size
> • 4K (Linux) or 8K (Solaris) page size
> Here are breakdowns of the shared-memory allocations during
> MPI_Init() in units of 106 bytes:
>
> pagesize alignment cacheline
> ------------------ alignment
> description 8K pages 4K pages
> ===============
> CB wrappers 8,682 4,391 235
> CB queues+tails 9,822 5,531 1,374
> CB heads 8,632 4,341 184
> eager freelists 9,171 9,032 8,898
> other 370 362 355
> ---------------
> total 36,677 23,658 11,046
> That is, with pagesize alignment, the CB allocations consume over
> 3n2 pages and dominate, even though most of that space is unused.
>
> The next biggest contributor is the eager freelists. There are 2n2
> eager fragments, each 4K (the eager limit), costing (approximately)
> 8 Gbytes.
>
> With cacheline alignment:
>
> • Overall consumption drops by over 2× compared to 4K pages and
> over 3× compared to 8K pages.
> • The remaining leading category (eager freelists) can now be
> scaled down by adjusting an existing MCA parameter btl_sm_eager_limit.
> • For that matter, the second leading category (CB queues) can also
> be scaled down by adjusting an existing MCA parameter
> btl_sm_size_of_cb_queue.
> Here are results when we not only drop from page-size to cacheline
> alignment, but we also aggregate CB allocations:
>
> 106 bytes description
> ========= ===============
> 1,250 FIFOs and CBs
> 8,899 eager freelists
> 270 max freelists
> ------ ---------------
> 10,418 total
> With no more pagesize dependence and little more cacheline
> dependence, one could really start to shoehorn big jobs into a small
> shared-memory area. E.g., consider bumping the eager limit down to
> 256 bytes, the size of a CB queue to 16 entries, and the chunk size
> to 8K. Then, shared-memory consumption for 1024 processes looks like
> this:
>
> 106 bytes description
> ========= ===============
> 311 FIFOs and CBs
> 544 eager freelists
> 68 max freelists
> ------ ---------------
> 924 total
> Ping-Pong Latency
> We can also look at performance. Here are OSU latency results for
> short messages on a Sun v20z. The absolute numbers are less
> important than the relative difference between the two sets:
>
> bytes before after
>
> 0 0.85 0.84 µsec
> 1 0.97 0.99
> 2 0.97 0.98
> 4 0.97 0.98
> 8 0.97 0.99
> There is a penalty for non-null messages due to OMPI "data
> convertors". Importantly, to within the reproducibility of the
> measurements, it is unclear if there is any slowdown that one can
> attribute to the changes. (Results are the median of 5 measurements.
> The values look smooth, but the error bars, which are difficult to
> characterize, are probably greater than the 0.01-0.02 µsec
> differences seen here.)
>
> Other Considerations
> Simply going from pagesize alignment to cacheline alignment should
> be a relatively unintrusive code change and effect most of the
> reduction in shared-memory allocation.
>
> Also aggregating allocations is more intrusive, but has a few more
> advantages, including:
>
> • Slight further reduction in the amount of shared memory allocated.
> • Less lock contention as the number of allocation is reduced from
> O(n2) to O(n), possibly improving start-up performance.
> • Can provide memory locality even on systems that don't have
> maffinity support.
> It would be nice to size the mmap file automatically better than
> what is done today, but (as noted) I haven't yet figured out how to
> make the btl_sm and mpool_sm components talk to each other.
>
> My proposed code changes need more testing, especially in the case
> of multiple memory nodes per node.
>
> It remains unclear to me if error codes are being treated properly
> in the mca_btl_sm_add_procs() code. E.g., if one process is unable
> to allocate memory in the shared area, should all processes fail?
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
Cisco Systems