RFC: Fragmented sm Allocations
RFC: Fragmented sm Allocations
WHAT: Dealing with the fragmented allocations of sm BTL FIFO
circular buffers (CB) during MPI_Init().
Also:
- Improve handling of error codes.
- Automate the sizing of the mmap file.
WHY: To reduce consumption of shared memory, making job startup
more robust, and possibly improving the scalability of startup.
WHERE: In mca_btl_sm_add_procs(), there is a loop
over calls to ompi_fifo_init(). This is where CBs are
initialized one at a time, components of a CB allocated individually.
Changes can be seen in
ssh://www.open-mpi.org/~eugene/hg/sm-allocation.
WHEN: Upon acceptance.
TIMEOUT: January 30, 2009.
WHY (details)
The sm BTL establishes a FIFO for each non-self, on-node
connection. Each FIFO is initialized during MPI_Init()
with a circular buffer (CB). (More CBs can be added later in
program execution if a FIFO runs out of room.)
A CB has different components that are used in different ways:
- The "wrapper" is read by both sender and receiver,
but is rarely written.
- The "queue" (FIFO entries) is accessed by both the
sender and receiver.
- The "head" is accessed by the sender.
- The "tail" is accessed by the receiver.
For performance reasons, a CB is not allocated as one large data structure.
Rather, these components are laid out separately in memory and the
wrapper has pointers to the various locations. Performance
considerations include:
- false sharing: a component used by one process should
not share a cacheline with another component that is
modified by another process
- NUMA: certain components should perhaps be mapped
preferentially to memory pages that are close to the
processes that access these components
Currently, the sm BTL handles these issues by allocating
each component of each CB its own page. (Actually, it couples
tails and queues together.)
As the number of on-node processes grows, however, the shared-memory
allocation skyrockets. E.g., let's say there are n processes
on-node. There are therefore n(n-1) = O(n2)
FIFOs, each with 3 allocations (wrapper, head, and tail/queue). The
shared-memory allocation for CBs becomes 3n2 pages.
For large n, this dominates the shared-memory consumption,
even though most of the CB allocation is unused. E.g., a 12-byte
"head" ends up consuming a full memory page!
Not only is the 3n2-page allocation large, but it
is also not tunable via any MCA parameters.
Large shared-memory consumption has led to some number of
start-up and other user problems. E.g., the e-mail thread at
http://www.open-mpi.org/community/lists/devel/2008/11/4882.php.
WHAT (details)
Several actions are recommended here.
1. Cacheline Rather than Pagesize Alignment
The first set of changes reduces pagesize to cacheline alignment.
Though mapping to pages is motivated by NUMA locality, note:
- The code already has NUMA locality optimizations
(maffinity and mpools) anyhow.
- There is no data that I'm aware of substantiating the
benefits of locality optimizations in this context.
|
More to the point, I've tried some such experiments myself.
I had two processes communicating via shared memory on a
large SMP that had a large difference between remote and local
memory access times. I timed the roundtrip latency for
pingpongs between the processes. That time was correlated
to the relative separation between the two processes, and
not at all to the placement of the physical memory backing
the shared variables. It did not matter if the memory was
local to the sender or receiver or remote from both! (E.g.,
colocal processes showed fast timings even if the shared
memory were remote to both processes.)
My results do not prove that all NUMA platforms behave in the
same way. My point is only that, though I understand the
logic behind locality optimizations for FIFO placement, the
only data I am aware of does not substantiate that logic.
|
The changes are:
- File: ompi/mca/mpool/sm/mpool_sm_module.c
Function: mca_mpool_sm_alloc()
Use the alignment requested by the caller rather than
adding additional pagesize alignment as well.
- File: ompi/class/ompi_fifo.h
Function: ompi_fifo_init()
and ompi_fifo_write_to_head()
Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page.
- File: ompi/class/ompi_circular_buffer_fifo.h
Function: ompi_cb_fifo_init()
Align the two calls to mpool_alloc on cacheline rather than page.
2. Aggregated Allocation
Another option is to lay out all the CBs at once and aggregate
their allocations.
This may have the added benefit of reducing lock contention during
MPI_Init(). On the one hand, the 3n2 CB
allocations during MPI_Init() contend for a single
mca_common_sm_mmap->map_seg->seg_lock lock. On the other
hand, I know so far of no data showing that this lock contention
is impairing start-up scalability.
The objectives here would be to consolidate many CB components together
subject to:
- queues should be local to receivers
- heads should be local to senders
- tails should be local to receivers
- wrappers should not share cachelines with heads or tails
In sum, for process myrank, the FIFO allocation in shared
memory during MPI_Init() looks something like this:
ompi_fifo_t from 0 to myrank
ompi_fifo_t from 1 to myrank
ompi_fifo_t from 2 to myrank
...
ompi_fifo_t from n-1 to myrank
--- cacheline boundary ---
queue of pointers, for CB from 0 to myrank
queue of pointers, for CB from 1 to myrank
queue of pointers, for CB from 2 to myrank
...
queue of pointers, for CB from n-1 to myrank
--- cacheline boundary ---
head for CB from myrank to 0
tail for CB from 0 to myrank
head for CB from myrank to 1
tail for CB from 1 to myrank
head for CB from myrank to 2
tail for CB from 2 to myrank
...
head for CB from myrank to n-1
tail for CB from n-1 to myrank
--- cacheline boundary ---
wrapper, CB from 0 to myrank
wrapper, CB from 1 to myrank
wrapper, CB from 2 to myrank
...
wrapper, CB from n-1 to myrank
|
Note that out-bound heads and in-bound tails are interwoven. There
should be no false sharing, however, even if multiple heads and tails
share the same cacheline, since they're all accessed by the same process.
For multi-threaded processes, however, it is conceivable that different
threads will be passing many messages to different peers. So, for
multi-threaded jobs, we spread heads and tails out onto their own
cachelines. Consuming the extra shared memory in the multi-threaded
case is probably tolerable since the on-node-process count will be
lower.
|
The changes are:
- File: ompi/class/ompi_circular_buffer_fifo.h
Function: ompi_cb_fifo_init()
Instead of having ompi_cb_fifo_init() allocate each
component of a CB separately, change the interface to allow
the caller function to pass an array of addresses in. These
addresses will be assumed to be allocated and should be used
for the CB components.
If the "array of addresses" is NULL, then we allocate
the CB components as before.
Here is pseudocode to illustrate this change. We replace
fifo->head = mpool_alloc(...); /* allocate head */
if ( NULL == fifo->head )
return OMPI_ERR_OUT_OF_RESOURCE;
with
if ( NULL != cb_addr ) {
fifo->head = cb_addr[1]; /* use supplied address, if available */
} else {
fifo->head = mpool_alloc(...); /* allocate head */
if ( NULL == fifo->head )
return OMPI_ERR_OUT_OF_RESOURCE;
}
- File: ompi/class/ompi_fifo.h
Function: ompi_fifo_init()
Change the interface to ompi_fifo_init() to allow
the caller to supply addresses to use for CB components
rather than having callees allocate memory for them.
Use such a supplied address, if available, when allocating
the FIFO (not CB) head.
Function: ompi_fifo_init()
Change the call in ompi_fifo_write_to_head()
to ompi_cb_fifo_init() to reflect the new
interface, passing NULL as the new argument.
- File: ompi/mca/btl/sm/btl_sm.c
Function: compute_initial_cb_fifo_space() and
compute_initial_cb_fifo_addrs()
Add these two functions to compute how much space will be
needed for the aggregated CB allocation and to compute
addresses for individual CB components for a particular
sender and receiver.
Function: sm_btl_first_time_init()
Increase the allocation of FIFOs (call to mpool_calloc)
to include room for the CBs.
Function: mca_btl_sm_add_procs()
Compute the addresses where CB components should be and
pass them into ompi_fifo_init().
These changes impact only the allocation of CBs during MPI_Init().
If FIFOs are grown later during program execution, they will continue
to have components allocated in a fragmented manner.
3. Free List Return Codes
This is unrelated to FIFOs, but is related to more robust handling
of shared-memory allocation.
The function sm_btl_first_time_init() should test the return
values when it allocates free lists. It currently does not test return
values, proceeding without a hiccup even if those allocations indicate
an error. The proposed change is:
4. Better Automatic Sizing of mmap File
Currently, the size of the file to be mmaped is governed
by three MCA parameters:
- mpool_sm_max_size
- mpool_sm_min_size
(default 128 Mbytes)
- mpool_sm_per_peer_size
(default 32 Mbytes)
Specifically, the file size is
min(mpool_sm_max_size,
max(mpool_sm_min_size,
n * mpool_sm_per_peer_size))
This file size is a poor approximation for the actual amount of
shared memory needed by an application during MPI_Init().
E.g., at n=2, the file is 128M even though less than 1M is needed.
At large n, however, the file is insufficiently small.
Instead, we should add code that produces a better estimate of how
much shared memory will be needed during MPI_Init().
Regarding the MCA parameters:
- mpool_sm_max_size: default should be 0 (no limit)
- mpool_sm_min_size: default should be 0 (no limit)
- mpool_sm_per_peer_size: should be eliminated
More accurate sizing could help reduce the problems users see
starting sm jobs with large on-node-process counts.
One problem is that the size of the shared file is set by
mpool_sm, but information about how much shared memory
needs to be allocated during MPI_Init() is in btl_sm.
Since OMPI doesn't easily allow components to call one another,
we're stuck.
Supporting Data
Memory Consumption
Memory consumption can be measured or modeled.
(I have a byte-accurate model.)
Here are some comparisons for the case of:
- 1024 on-node processes
- 8-byte pointers (LP64 execution model)
- 4K eager limit
- 32K chunk size
- 128-byte cacheline size
- 4K (Linux) or 8K (Solaris) page size
Here are breakdowns of the shared-memory allocations during MPI_Init()
in units of 106 bytes:
pagesize alignment cacheline
------------------ alignment
description 8K pages 4K pages
===============
CB wrappers 8,682 4,391 235
CB queues+tails 9,822 5,531 1,374
CB heads 8,632 4,341 184
eager freelists 9,171 9,032 8,898
other 370 362 355
---------------
total 36,677 23,658 11,046
That is, with pagesize alignment, the CB allocations consume over
3n2 pages and dominate, even though most of that
space is unused.
The next biggest contributor is the eager freelists. There are
2n2 eager fragments, each 4K (the eager limit),
costing (approximately) 8 Gbytes.
With cacheline alignment:
- Overall consumption drops by over 2× compared to
4K pages and over 3× compared to 8K pages.
- The remaining leading category (eager freelists) can
now be scaled down by adjusting an existing MCA parameter
btl_sm_eager_limit.
- For that matter, the second leading category (CB queues)
can also be scaled down by adjusting an existing MCA
parameter btl_sm_size_of_cb_queue.
Here are results when we not only drop from page-size to
cacheline alignment, but we also aggregate CB allocations:
106 bytes description
========= ===============
1,250 FIFOs and CBs
8,899 eager freelists
270 max freelists
------ ---------------
10,418 total
With no more pagesize dependence and little more cacheline
dependence, one could really start to shoehorn big jobs into
a small shared-memory area. E.g., consider bumping the eager
limit down to 256 bytes, the size of a CB queue to 16 entries,
and the chunk size to 8K. Then, shared-memory consumption for
1024 processes looks like this:
106 bytes description
========= ===============
311 FIFOs and CBs
544 eager freelists
68 max freelists
------ ---------------
924 total
Ping-Pong Latency
We can also look at performance. Here are OSU latency results
for short messages on a Sun v20z. The absolute numbers are
less important than the relative difference between the two
sets:
bytes before after
0 0.85 0.84 µsec
1 0.97 0.99
2 0.97 0.98
4 0.97 0.98
8 0.97 0.99
There is a penalty for non-null messages due to OMPI "data convertors".
Importantly, to within the reproducibility of the measurements, it
is unclear if there is any slowdown that one can attribute to the
changes. (Results are the median of 5 measurements. The values
look smooth, but the error bars, which are difficult to characterize,
are probably greater than the 0.01-0.02 µsec differences seen
here.)
Other Considerations
Simply going from pagesize alignment to cacheline alignment
should be a relatively unintrusive code change and effect most
of the reduction in shared-memory allocation.
Also aggregating allocations is more intrusive, but has a few more
advantages, including:
- Slight further reduction in the amount of shared memory
allocated.
- Less lock contention as the number of allocation is
reduced from O(n2) to O(n),
possibly improving start-up performance.
- Can provide memory locality even on systems that
don't have maffinity support.
It would be nice to size the mmap file automatically
better than what is done today, but (as noted) I haven't yet
figured out how to make the btl_sm and mpool_sm
components talk to each other.
My proposed code changes need more testing, especially in the
case of multiple memory nodes per node.
It remains unclear to me if error codes are being treated
properly in the mca_btl_sm_add_procs() code. E.g.,
if one process is unable to allocate memory in the shared
area, should all processes fail?
|