FAQ:
Tuning the run-time characteristics of MPI shared memory communications

| Home | Support | FAQ | all just the FAQ

About

Presentations

Open MPI Team

FAQ

Rollup/ALL

General information

Supported systems

Contributing

Developer information

Sysadmin information

Fault Tolerance

Building

Building Open MPI

Removed MPI constructs

Compiling MPI apps

Running Jobs

Running MPI jobs

Troubleshooting

Parallel debugging

rsh/ssh

BProc

Torque / PBS Pro

Slurm

SGE

Large clusters

Tuning

General tuning

Shared memory (Vader)

TCP

IB, RoCE, iWARP

Omni-Path

Performance tools

OMPIO

UDAPL

Myrinet

Platform

OS X

AIX (unsupported)

Contrib

VampirTrace

Languages

Java

CUDA-aware

Building CUDA-aware

Running CUDA-aware

Videos

Performance

Open MPI Software

Download

Documentation

Source Code Access

Bug Tracking

Regression Testing

Version Information

Sub-Projects

Hardware Locality

Network Locality

MPI Testing Tool

Open Tool for Parameter Optimization

Community

Mailing Lists

Getting Help/Support

Contribute

Contact

License

This FAQ is for Open MPI v4.x and earlier.
If you are looking for documentation for Open MPI v5.x and later, please visit docs.open-mpi.org.

Table of contents:

What is the vader BTL?
What is the sm BTL?
How do I specify use of sm for MPI messages?
How does the sm BTL work?
Why does my MPI job no longer start when there are too many processes on one node?
How do I know what MCA parameters are available for tuning MPI performance?
How can I tune these parameters to improve performance?
Where is the file that sm will mmap in?
Why am I seeing incredibly poor performance with the sm BTL?
Can I use SysV instead of mmap?
How much shared memory will my job use?
How much shared memory do I need?
How can I decrease my shared-memory usage?

1. What is the vader BTL?

The vader BTL is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.

Beginning with the v1.8 series, the vader BTL replaces the sm BTL unless the local system lacks the required support or the user specifically requests the latter be used. At this time, vader requires CMA support which is typically found in more current kernels. Thus, systems based on older kernels may default to the slower sm BTL.

2. What is the sm BTL?

The sm BTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.

The sm BTL has high exclusivity. That is, if one process can reach another process via sm, then no other BTL will be considered for that connection.

Note that with Open MPI v1.3.2, the sm so-called "FIFOs" were reimplemented and the sizing of the shared-memory area was changed. So, much of this FAQ will distinguish between releases up to Open MPI v1.3.1 and releases starting with Open MPI v1.3.2.

3. How do I specify use of sm for MPI messages?

Typically, it is unnecessary to do so; OMPI will use the best BTL available for each communication.

Nevertheless, you may use the MCA parameter btl. You should also specify the self BTL for communications between a process and itself. Furthermore, if not all processes in your job will run on the same, single node, then you also need to specify a BTL for internode communications. For example:

1	shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out

4. How does the sm BTL work?

A point-to-point user message is broken up by the PML into fragments. The sm BTL only has to transfer individual fragments. The steps are:

The sender pulls a shared-memory fragment out of one of its free lists. Each process has one free list for smaller (e.g., 4Kbyte) eager fragments and another free list for larger (e.g., 32Kbyte) max fragments.
The sender packs the user-message fragment into this shared-memory fragment, including any header information.
The sender posts a pointer to this shared fragment into the appropriate FIFO (first-in-first-out) queue of the receiver.
The receiver polls its FIFO(s). When it finds a new fragment pointer, it unpacks data out of the shared-memory fragment and notifies the sender that the shared fragment is ready for reuse (to be returned to the sender's free list).

On each node where an MPI job has two or more processes running, the job creates a file that each process mmaps into its address space. Shared-memory resources that the job needs — such as FIFOs and fragment free lists — are allocated from this shared-memory area.

5. Why does my MPI job no longer start when there are too many processes on one node?

If you are using Open MPI v1.3.1 or earlier, it is possible that the shared-memory area set aside for your job was not created large enough. Make sure you're running in 64-bit mode (compiled with -m64) and set the MCA parameter mpool_sm_max_size to be very large — even several Gbytes. Exactly how large is discussed further below.

Regardless of which OMPI release you're using, make sure that there is sufficient space for a large file to back the shared memory — typically in /tmp.

6. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for the sm BTL and sm mpool:

1 2	shell$ ompi_info --param btl sm shell$ ompi_info --param mpool sm

7. How can I tune these parameters to improve performance?

Mostly, the default values of the MCA parameters have already been chosen to give good performance. To improve performance further is a little bit of an art. Sometimes, it's a matter of trading off performance for memory.

btl_sm_eager_limit: If message data plus header information fits within this limit, the message is sent "eagerly" — that is, a sender attempts to write its entire message to shared buffers without waiting for a receiver to be ready. Above this size, a sender will only write the first part of a message, then wait for the receiver to acknowledge its readiness before continuing. Eager sends can improve performance by decoupling senders from receivers.

btl_sm_max_send_size: Large messages are sent in fragments of this size. Larger segments can lead to greater efficiencies, though they could perhaps also inhibit pipelining between sender and receiver.

btl_sm_num_fifos: Starting in Open MPI v1.3.2, this is the number of FIFOs per receiving process. By default, there is only one FIFO per process. Conceivably, if many senders are all sending to the same process and contending for a single FIFO, there will be congestion. If there are many FIFOs, however, the receiver must poll more FIFOs to find incoming messages. Therefore, you might try increasing this parameter slightly if you have many (at least dozens) of processes all sending to the same process. For example, if 100 senders are all contending for a single FIFO for a particular receiver, it may suffice to increase btl_sm_num_fifos from 1 to 2.

btl_sm_fifo_size: Starting in Open MPI v1.3.2, FIFOs could no longer grow. If you believe the FIFO is getting congested because a process falls far behind in reading incoming message fragments, increase this size manually.

btl_sm_free_list_num: This is the initial number of fragments on each (eager and max) free list. The free lists can grow in response to resource congestion, but you can increase this parameter to pre-reserve space for more fragments.

mpool_sm_min_size: You can reserve headroom for the shared-memory area to grow by increasing this parameter.

8. Where is the file that sm will mmap in?

The file will be in the OMPI session directory, which is typically something like /tmp/openmpi-sessions-myusername@mynodename/* . The file itself will have the name shared_mem_pool.mynodename. For example, the full path could be /tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.

To place the session directory in a non-default location, use the MCA parameter orte_tmpdir_base.

9. Why am I seeing incredibly poor performance with the sm BTL?

The most common problem with the shared memory BTL is when the Open MPI session directory is placed on a network filesystem (e.g., if /tmp is not on a local disk). This is because the shared memory BTL places a memory-mapped file in the Open MPI session directory (see this entry for more details). If the session directory is located on a network filesystem, the shared memory BTL latency will be extremely high.

Try not mounting /tmp as a network filesystem, and/or moving the Open MPI session directory to a local filesystem.

Some users have reported success and possible performance optimizations with having /tmp mounted as a "tmpfs" filesystem (i.e., a RAM-based filesystem). However, before configuring your system this way, be aware of a few items:

Open MPI writes a few small meta data files into /tmp and may therefore consume some extra memory that could have otherwise been used for application instruction or data state.
If you use the "filem" system in Open MPI for moving executables between nodes, these files are stored under /tmp.
Open MPI's checkpoint / restart files can also be saved under /tmp.
If the Open MPI job is terminated abnormally, there are some circumstances where files (including memory-mapped shared memory files) can be left in /tmp. This can happen, for example, when a resource manager forcibly kills an Open MPI job and does not give it the chance to clean up /tmp files and directories.

Some users have reported success with configuring their resource manager to run a script between jobs to forcibly empty the /tmp directory.

10. Can I use SysV instead of mmap?

In the v1.3 and v1.4 Open MPI series, shared memory is established via mmap. In future releases, there may be an option for using SysV shared memory.

11. How much shared memory will my job use?

Your job will create a shared-memory area on each node where it has two or more processes. This area will be fixed during the lifetime of your job. Shared-memory allocations (for FIFOs and fragment free lists) will be made in this area. Here, we look at the size of that shared-memory area.

If you want just one hard number, then go with approximately 128 Mbytes per node per job, shared by all the job's processes on that node. That is, an OMPI job will need more than a few Mbytes per node, but typically less than a few Gbytes.

Better yet, read on.

Up through Open MPI v1.3.1, the shared-memory file would basically be sized thusly:

1
2
3

nbytes = n * mpool_sm_per_peer_size
if ( nbytes < mpool_sm_min_size ) nbytes = mpool_sm_min_size
if ( nbytes > mpool_sm_max_size ) nbytes = mpool_sm_max_size

where n is the number of processes in the job running on that particular node and the mpool_sm_* are MCA parameters. For small n, this size is typically excessive. For large n (e.g., 128 MPI processes on the same node), this size may not be sufficient for the job to start.

Starting in OMPI v1.3.2, a more sophisticated formula was introduced to model more closely how much memory was actually needed. That formula is somewhat complicated and subject to change. It guarantees that there will be at least enough shared memory for the program to start up and run. See this FAQ item to see how much is needed. Alternatively, the motivated user can examine the OMPI source code to see the formula used — for example, here is the formula in OMPI commit 463f11f.

OMPI v1.3.2 also uses the MCA parameter mpool_sm_min_size to set a minimum size — e.g., so that there is not only enough shared memory for the job to start, but additionally headroom for further shared-memory allocations (e.g., of more eager or max fragments).

Once the shared-memory area is established, it will not grow further during the course of the MPI job's run.

12. How much shared memory do I need?

In most cases, OMPI will start your job with sufficient shared memory.

Nevertheless, if OMPI doesn't get you enough shared memory (e.g., you're using OMPI v1.3.1 or earlier with roughly 128 processes or more on a single node) or you want to trim shared-memory consumption, you may want to know how much shared memory is really needed.

As we saw earlier, the shared memory area contains:

FIFOs
eager fragments
max fragments

In general, you need only enough shared memory for the FIFOs and fragments that are allocated during MPI_Init().

Beyond that, you might want additional shared memory for performance reasons, so that FIFOs and fragment lists can grow if your program's message traffic encounters resource congestion. Even if there is no room to grow, however, your correctly written MPI program should still run to completion in the face of congestion; performance simply degrades somewhat. Note that while shared-memory resources can grow after MPI_Init(), they cannot shrink.

So, how much shared memory is needed during MPI_Init() ? You need approximately the total of:

FIFOs:
- (≤ Open MPI v1.3.1): 3 × n × n × pagesize
- (≥ Open MPI v1.3.2): n × btl_sm_num_fifos × btl_sm_fifo_size × sizeof(void *)
eager fragments: n × ( 2 × n + btl_sm_free_list_inc ) × btl_sm_eager_limit
max fragments: n × btl_sm_free_list_num × btl_sm_max_send_size

where:

n is the number of MPI processes in your job on the node
pagesize is the OS page size (4KB for Linux and 8KB for Solaris)
btl_sm_* are MCA parameters

13. How can I decrease my shared-memory usage?

There are two parts to this question.

First, how does one reduce how big the mmap file is? The answer is:

Up to Open MPI v1.3.1: Reduce mpool_sm_per_peer_size, mpool_sm_min_size, and mpool_sm_max_size
Starting with Open MPI v1.3.2: Reduce mpool_sm_min_size

Second, how does one reduce how much shared memory is needed? (Just making the mmap file smaller doesn't help if then your job won't start up.) The answers are:

For small values of n — that is, for few processes per node — shared-memory usage during MPI_Init() is predominantly for max free lists. So, you can reduce the MCA parameter btl_sm_max_send_size. Alternatively, you could reduce btl_sm_free_list_num, but it is already pretty small by default.
For large values of n — that is, for many processes per node — there are two cases:
- Up to Open MPI v1.3.1: Shared-memory usage is dominated by the FIFOs, which consume a certain number of pages. Usage is high and cannot be reduced much via MCA parameter tuning.
- Starting with Open MPI v1.3.2: Shared-memory usage is dominated by the eager free lists. So, you can reduce the MCA parameter btl_sm_eager_limit.

FAQ: Tuning the run-time characteristics of MPI shared memory communications

FAQ:
Tuning the run-time characteristics of MPI shared memory communications