Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Fragmented sm Allocations
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-01-14 08:54:18


I haven't reviewed the code either, but really do appreciate someone
taking the time for such a thorough analysis of the problems we have
all observed for some time!

Thanks Eugene!!

On Jan 14, 2009, at 5:05 AM, Tim Mattox wrote:

> Great analysis and suggested changes! I've not had a chance yet
> to look at your hg branch, so this sin't a code review... Barring a
> bad code review, I'd say these changes should all go in the trunk
> for inclusion in 1.4.
>
> 2009/1/14 Eugene Loh <Eugene.Loh_at_[hidden]>:
>>
>>
>> RFC: Fragmented sm Allocations
>>
>> WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular
>> buffers (CB) during MPI_Init().
>>
>> Also:
>>
>> Improve handling of error codes.
>> Automate the sizing of the mmap file.
>>
>> WHY: To reduce consumption of shared memory, making job startup
>> more robust,
>> and possibly improving the scalability of startup.
>>
>> WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to
>> ompi_fifo_init(). This is where CBs are initialized one at a time,
>> components of a CB allocated individually. Changes can be seen in
>> ssh://www.open-mpi.org/~eugene/hg/sm-allocation.
>>
>> WHEN: Upon acceptance.
>>
>> TIMEOUT: January 30, 2009.
>>
>> ________________________________
>>
>> WHY (details)
>>
>> The sm BTL establishes a FIFO for each non-self, on-node
>> connection. Each
>> FIFO is initialized during MPI_Init() with a circular buffer (CB).
>> (More CBs
>> can be added later in program execution if a FIFO runs out of room.)
>>
>> A CB has different components that are used in different ways:
>>
>> The "wrapper" is read by both sender and receiver, but is rarely
>> written.
>> The "queue" (FIFO entries) is accessed by both the sender and
>> receiver.
>> The "head" is accessed by the sender.
>> The "tail" is accessed by the receiver.
>>
>> For performance reasons, a CB is not allocated as one large data
>> structure.
>> Rather, these components are laid out separately in memory and the
>> wrapper
>> has pointers to the various locations. Performance considerations
>> include:
>>
>> false sharing: a component used by one process should not share a
>> cacheline
>> with another component that is modified by another process
>> NUMA: certain components should perhaps be mapped preferentially to
>> memory
>> pages that are close to the processes that access these components
>>
>> Currently, the sm BTL handles these issues by allocating each
>> component of
>> each CB its own page. (Actually, it couples tails and queues
>> together.)
>>
>> As the number of on-node processes grows, however, the shared-memory
>> allocation skyrockets. E.g., let's say there are n processes on-
>> node. There
>> are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations
>> (wrapper, head,
>> and tail/queue). The shared-memory allocation for CBs becomes 3n2
>> pages. For
>> large n, this dominates the shared-memory consumption, even though
>> most of
>> the CB allocation is unused. E.g., a 12-byte "head" ends up
>> consuming a full
>> memory page!
>>
>> Not only is the 3n2-page allocation large, but it is also not
>> tunable via
>> any MCA parameters.
>>
>> Large shared-memory consumption has led to some number of start-up
>> and other
>> user problems. E.g., the e-mail thread at
>> http://www.open-mpi.org/community/lists/devel/2008/11/4882.php.
>>
>> WHAT (details)
>>
>> Several actions are recommended here.
>>
>> 1. Cacheline Rather than Pagesize Alignment
>>
>> The first set of changes reduces pagesize to cacheline alignment.
>> Though
>> mapping to pages is motivated by NUMA locality, note:
>>
>> The code already has NUMA locality optimizations (maffinity and
>> mpools)
>> anyhow.
>> There is no data that I'm aware of substantiating the benefits of
>> locality
>> optimizations in this context.
>>
>> More to the point, I've tried some such experiments myself. I had two
>> processes communicating via shared memory on a large SMP that had a
>> large
>> difference between remote and local memory access times. I timed the
>> roundtrip latency for pingpongs between the processes. That time was
>> correlated to the relative separation between the two processes,
>> and not at
>> all to the placement of the physical memory backing the shared
>> variables. It
>> did not matter if the memory was local to the sender or receiver or
>> remote
>> from both! (E.g., colocal processes showed fast timings even if the
>> shared
>> memory were remote to both processes.)
>>
>> My results do not prove that all NUMA platforms behave in the same
>> way. My
>> point is only that, though I understand the logic behind locality
>> optimizations for FIFO placement, the only data I am aware of does
>> not
>> substantiate that logic.
>>
>> The changes are:
>>
>> File: ompi/mca/mpool/sm/mpool_sm_module.c
>> Function: mca_mpool_sm_alloc()
>>
>> Use the alignment requested by the caller rather than adding
>> additional
>> pagesize alignment as well.
>>
>> File: ompi/class/ompi_fifo.h
>> Function: ompi_fifo_init() and ompi_fifo_write_to_head()
>>
>> Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.
>>
>> File: ompi/class/ompi_circular_buffer_fifo.h
>> Function: ompi_cb_fifo_init()
>>
>> Align the two calls to mpool_alloc on cacheline rather than page.
>>
>> 2. Aggregated Allocation
>>
>> Another option is to lay out all the CBs at once and aggregate their
>> allocations.
>>
>> This may have the added benefit of reducing lock contention during
>> MPI_Init(). On the one hand, the 3n2 CB allocations during MPI_Init()
>> contend for a single mca_common_sm_mmap->map_seg->seg_lock lock. On
>> the
>> other hand, I know so far of no data showing that this lock
>> contention is
>> impairing start-up scalability.
>>
>> The objectives here would be to consolidate many CB components
>> together
>> subject to:
>>
>> queues should be local to receivers
>> heads should be local to senders
>> tails should be local to receivers
>> wrappers should not share cachelines with heads or tails
>>
>> In sum, for process myrank, the FIFO allocation in shared memory
>> during
>> MPI_Init() looks something like this:
>>
>> ompi_fifo_t from 0 to myrank
>> ompi_fifo_t from 1 to myrank
>> ompi_fifo_t from 2 to myrank
>> ...
>> ompi_fifo_t from n-1 to myrank
>> --- cacheline boundary ---
>> queue of pointers, for CB from 0 to myrank
>> queue of pointers, for CB from 1 to myrank
>> queue of pointers, for CB from 2 to myrank
>> ...
>> queue of pointers, for CB from n-1 to myrank
>> --- cacheline boundary ---
>> head for CB from myrank to 0
>> tail for CB from 0 to myrank
>> head for CB from myrank to 1
>> tail for CB from 1 to myrank
>> head for CB from myrank to 2
>> tail for CB from 2 to myrank
>> ...
>> head for CB from myrank to n-1
>> tail for CB from n-1 to myrank
>> --- cacheline boundary ---
>> wrapper, CB from 0 to myrank
>> wrapper, CB from 1 to myrank
>> wrapper, CB from 2 to myrank
>> ...
>> wrapper, CB from n-1 to myrank
>>
>> Note that out-bound heads and in-bound tails are interwoven. There
>> should be
>> no false sharing, however, even if multiple heads and tails share
>> the same
>> cacheline, since they're all accessed by the same process.
>>
>> For multi-threaded processes, however, it is conceivable that
>> different
>> threads will be passing many messages to different peers. So, for
>> multi-threaded jobs, we spread heads and tails out onto their own
>> cachelines. Consuming the extra shared memory in the multi-threaded
>> case is
>> probably tolerable since the on-node-process count will be lower.
>>
>> The changes are:
>>
>> File: ompi/class/ompi_circular_buffer_fifo.h
>> Function: ompi_cb_fifo_init()
>>
>> Instead of having ompi_cb_fifo_init() allocate each component of a CB
>> separately, change the interface to allow the caller function to
>> pass an
>> array of addresses in. These addresses will be assumed to be
>> allocated and
>> should be used for the CB components.
>>
>> If the "array of addresses" is NULL, then we allocate the CB
>> components as
>> before.
>>
>> Here is pseudocode to illustrate this change. We replace
>>
>> fifo->head = mpool_alloc(...); /* allocate head */
>> if ( NULL == fifo->head )
>> return OMPI_ERR_OUT_OF_RESOURCE;
>>
>>
>> with
>>
>> if ( NULL != cb_addr ) {
>> fifo->head = cb_addr[1]; /* use supplied address, if
>> available */
>> } else {
>> fifo->head = mpool_alloc(...); /* allocate head */
>> if ( NULL == fifo->head )
>> return OMPI_ERR_OUT_OF_RESOURCE;
>> }
>>
>>
>> File: ompi/class/ompi_fifo.h
>> Function: ompi_fifo_init()
>>
>> Change the interface to ompi_fifo_init() to allow the caller to
>> supply
>> addresses to use for CB components rather than having callees
>> allocate
>> memory for them.
>>
>> Use such a supplied address, if available, when allocating the FIFO
>> (not CB)
>> head.
>>
>> Function: ompi_fifo_init()
>>
>> Change the call in ompi_fifo_write_to_head() to ompi_cb_fifo_init()
>> to
>> reflect the new interface, passing NULL as the new argument.
>>
>> File: ompi/mca/btl/sm/btl_sm.c
>> Function: compute_initial_cb_fifo_space() and
>> compute_initial_cb_fifo_addrs()
>>
>> Add these two functions to compute how much space will be needed
>> for the
>> aggregated CB allocation and to compute addresses for individual CB
>> components for a particular sender and receiver.
>>
>> Function: sm_btl_first_time_init()
>>
>> Increase the allocation of FIFOs (call to mpool_calloc) to include
>> room for
>> the CBs.
>>
>> Function: mca_btl_sm_add_procs()
>>
>> Compute the addresses where CB components should be and pass them
>> into
>> ompi_fifo_init().
>>
>> These changes impact only the allocation of CBs during MPI_Init().
>> If FIFOs
>> are grown later during program execution, they will continue to have
>> components allocated in a fragmented manner.
>>
>> 3. Free List Return Codes
>>
>> This is unrelated to FIFOs, but is related to more robust handling of
>> shared-memory allocation.
>>
>> The function sm_btl_first_time_init() should test the return values
>> when it
>> allocates free lists. It currently does not test return values,
>> proceeding
>> without a hiccup even if those allocations indicate an error. The
>> proposed
>> change is:
>>
>> File: ompi/mca/btl/sm/btl_sm.c
>> Function: sm_btl_first_time_init()
>>
>> Test the return codes from the calls to:
>>
>> ompi_free_list_init_new()
>> ompi_free_list_init_new()
>> opal_free_list_init()
>>
>>
>> returning non-successful error status in case of error.
>>
>> 4. Better Automatic Sizing of mmap File
>>
>> Currently, the size of the file to be mmaped is governed by three MCA
>> parameters:
>>
>> mpool_sm_max_size
>> mpool_sm_min_size (default 128 Mbytes)
>> mpool_sm_per_peer_size (default 32 Mbytes)
>>
>> Specifically, the file size is
>>
>> min(mpool_sm_max_size,
>> max(mpool_sm_min_size,
>> n * mpool_sm_per_peer_size))
>>
>> This file size is a poor approximation for the actual amount of
>> shared
>> memory needed by an application during MPI_Init(). E.g., at n=2,
>> the file is
>> 128M even though less than 1M is needed. At large n, however, the
>> file is
>> insufficiently small.
>>
>> Instead, we should add code that produces a better estimate of how
>> much
>> shared memory will be needed during MPI_Init().
>>
>> Regarding the MCA parameters:
>>
>> mpool_sm_max_size: default should be 0 (no limit)
>> mpool_sm_min_size: default should be 0 (no limit)
>> mpool_sm_per_peer_size: should be eliminated
>>
>> More accurate sizing could help reduce the problems users see
>> starting sm
>> jobs with large on-node-process counts.
>>
>> One problem is that the size of the shared file is set by mpool_sm,
>> but
>> information about how much shared memory needs to be allocated during
>> MPI_Init() is in btl_sm. Since OMPI doesn't easily allow components
>> to call
>> one another, we're stuck.
>>
>> Supporting Data
>>
>> Memory Consumption
>>
>> Memory consumption can be measured or modeled. (I have a byte-
>> accurate
>> model.)
>>
>> Here are some comparisons for the case of:
>>
>> 1024 on-node processes
>> 8-byte pointers (LP64 execution model)
>> 4K eager limit
>> 32K chunk size
>> 128-byte cacheline size
>> 4K (Linux) or 8K (Solaris) page size
>>
>> Here are breakdowns of the shared-memory allocations during
>> MPI_Init() in
>> units of 106 bytes:
>>
>> pagesize alignment cacheline
>> ------------------ alignment
>> description 8K pages 4K pages
>> ===============
>> CB wrappers 8,682 4,391 235
>> CB queues+tails 9,822 5,531 1,374
>> CB heads 8,632 4,341 184
>> eager freelists 9,171 9,032 8,898
>> other 370 362 355
>> ---------------
>> total 36,677 23,658 11,046
>>
>> That is, with pagesize alignment, the CB allocations consume over
>> 3n2 pages
>> and dominate, even though most of that space is unused.
>>
>> The next biggest contributor is the eager freelists. There are 2n2
>> eager
>> fragments, each 4K (the eager limit), costing (approximately) 8
>> Gbytes.
>>
>> With cacheline alignment:
>>
>> Overall consumption drops by over 2× compared to 4K pages and over 3×
>> compared to 8K pages.
>> The remaining leading category (eager freelists) can now be scaled
>> down by
>> adjusting an existing MCA parameter btl_sm_eager_limit.
>> For that matter, the second leading category (CB queues) can also
>> be scaled
>> down by adjusting an existing MCA parameter btl_sm_size_of_cb_queue.
>>
>> Here are results when we not only drop from page-size to cacheline
>> alignment, but we also aggregate CB allocations:
>>
>> 106 bytes description
>> ========= ===============
>> 1,250 FIFOs and CBs
>> 8,899 eager freelists
>> 270 max freelists
>> ------ ---------------
>> 10,418 total
>>
>> With no more pagesize dependence and little more cacheline
>> dependence, one
>> could really start to shoehorn big jobs into a small shared-memory
>> area.
>> E.g., consider bumping the eager limit down to 256 bytes, the size
>> of a CB
>> queue to 16 entries, and the chunk size to 8K. Then, shared-memory
>> consumption for 1024 processes looks like this:
>>
>> 106 bytes description
>> ========= ===============
>> 311 FIFOs and CBs
>> 544 eager freelists
>> 68 max freelists
>> ------ ---------------
>> 924 total
>>
>> Ping-Pong Latency
>>
>> We can also look at performance. Here are OSU latency results for
>> short
>> messages on a Sun v20z. The absolute numbers are less important
>> than the
>> relative difference between the two sets:
>>
>> bytes before after
>>
>> 0 0.85 0.84 µsec
>> 1 0.97 0.99
>> 2 0.97 0.98
>> 4 0.97 0.98
>> 8 0.97 0.99
>>
>> There is a penalty for non-null messages due to OMPI "data
>> convertors".
>> Importantly, to within the reproducibility of the measurements, it is
>> unclear if there is any slowdown that one can attribute to the
>> changes.
>> (Results are the median of 5 measurements. The values look smooth,
>> but the
>> error bars, which are difficult to characterize, are probably
>> greater than
>> the 0.01-0.02 µsec differences seen here.)
>>
>> Other Considerations
>>
>> Simply going from pagesize alignment to cacheline alignment should
>> be a
>> relatively unintrusive code change and effect most of the reduction
>> in
>> shared-memory allocation.
>>
>> Also aggregating allocations is more intrusive, but has a few more
>> advantages, including:
>>
>> Slight further reduction in the amount of shared memory allocated.
>> Less lock contention as the number of allocation is reduced from
>> O(n2) to
>> O(n), possibly improving start-up performance.
>> Can provide memory locality even on systems that don't have maffinity
>> support.
>>
>> It would be nice to size the mmap file automatically better than
>> what is
>> done today, but (as noted) I haven't yet figured out how to make
>> the btl_sm
>> and mpool_sm components talk to each other.
>>
>> My proposed code changes need more testing, especially in the case of
>> multiple memory nodes per node.
>>
>> It remains unclear to me if error codes are being treated properly
>> in the
>> mca_btl_sm_add_procs() code. E.g., if one process is unable to
>> allocate
>> memory in the shared area, should all processes fail?
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
> tmattox_at_[hidden] || timattox_at_[hidden]
> I'm a bright... http://www.the-brights.net/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel