FAQ:
Rollup of ALL FAQ categories and questions

| Home | Support | FAQ | all just the FAQ

About

Presentations

Open MPI Team

FAQ

Rollup/ALL

General information

Supported systems

Contributing

Developer information

Sysadmin information

Fault Tolerance

Building

Building Open MPI

Removed MPI constructs

Compiling MPI apps

Running Jobs

Running MPI jobs

Troubleshooting

Parallel debugging

rsh/ssh

BProc

Torque / PBS Pro

Slurm

SGE

Large clusters

Tuning

General tuning

Shared memory (Vader)

TCP

IB, RoCE, iWARP

Omni-Path

Performance tools

OMPIO

UDAPL

Myrinet

Platform

OS X

AIX (unsupported)

Contrib

VampirTrace

Languages

Java

CUDA-aware

Building CUDA-aware

Running CUDA-aware

Videos

Performance

Open MPI Software

Download

Documentation

Source Code Access

Bug Tracking

Regression Testing

Version Information

Sub-Projects

Hardware Locality

Network Locality

MPI Testing Tool

Open Tool for Parameter Optimization

Community

Mailing Lists

Getting Help/Support

Contribute

Contact

License

This FAQ is for Open MPI v4.x and earlier.
If you are looking for documentation for Open MPI v5.x and later, please visit docs.open-mpi.org.

Table of contents:

What is MPI? What is Open MPI?
Where can I learn about MPI? Are there tutorials available?
What are the goals of the Open MPI Project?
Will you allow external involvement?
How is this software licensed?
I want to redistribute Open MPI. Can I?
Preventing forking is a goal; how will you enforce that?
How are 3rd party contributions handled?
Is this just YAMPI (yet another MPI implementation)?
But I love [FT-MPI | LA-MPI | LAM/MPI | PACX-MPI]! Why should I use Open MPI?
What will happen to the prior projects?
What operating systems does Open MPI support?
What hardware platforms does Open MPI support?
What network interconnects does Open MPI support?
What run-time environments does Open MPI support?
Does Open MPI support LSF?
How much MPI does Open MPI support?
Is Open MPI thread safe?
Does Open MPI support 32 bit environments?
Does Open MPI support 64 bit environments?
Does Open MPI support execution in heterogeneous environments?
Does Open MPI support parallel debuggers?
Can I contribute to Open MPI?
I found a bug! How do I report it?
What license is Open MPI distributed under?
How do I contribute code to Open MPI?
I can't submit an Open MPI Third Party Contribution Agreement; how can I contribute to Open MPI?
What if I don't want my contribution to be free / open source?
I want to fork the Open MPI code base. Can I?
Rats! My contribution was not accepted into the main Open MPI code base. What now?
Open MPI terminology
How do I get a copy of the most recent source code?
Ok, I got a Git checkout. Now how do I build it?
What is the main tree layout of the Open MPI source tree? Are there directory name conventions?
Is there more information available?
I'm a sysadmin; what do I care about Open MPI?
What hardware / software / run-time environments / networks does Open MPI support?
Do I need multiple Open MPI installations?
What are MCA Parameters? Why would I set them?
Do my users need to have their own installation of Open MPI?
I have power users who will want to override my global MCA parameters; is this possible?
What MCA parameters should I, the system administrator, set?
I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps?
I just upgraded my Myrinet|Infiniband network; do I need to recompile all my MPI apps?
We just upgraded our version of Open MPI; do I need to recompile all my MPI apps?
I have an MPI application compiled for another MPI; will it work with Open MPI?
What is "fault tolerance"?
What fault tolerance techniques has/does/will Open MPI support?
Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?
Where can I find the fault tolerance development work?
Does Open MPI support end-to-end data reliability in MPI message passing?
How do I build Open MPI?
Wow — I see a lot of errors during configure. Is that normal?
What are the default build options for Open MPI?
Open MPI was pre-installed on my machine; should I overwrite it with a new version?
Where should I install Open MPI?
Should I install a new version of Open MPI over an old version?
Can I disable Open MPI's use of plugins?
How do I build an optimized version of Open MPI?
Are VPATH and/or parallel builds supported?
Do I need any special tools to build Open MPI?
How do I build Open MPI as a static library?
When I run 'make', it looks very much like the build system is going into a loop.
Configure issues warnings about sed and unterminated commands
Open MPI configured ok, but I get "Makefile:602: *** missing separator" kinds of errors when building
Open MPI seems to default to building with the GNU compiler set. Can I use other compilers?
Can I pass specific flags to the compilers / linker used to build Open MPI?
I'm trying to build with the Intel compilers, but Open MPI eventually fails to compile with really long error messages. What do I do?
When I build with the Intel compiler suite, linking user MPI applications with the wrapper compilers results in warning messages. What do I do?
I'm trying to build with the IBM compilers, but Open MPI eventually fails to compile. What do I do?
I'm trying to build with the Oracle Solaris Studio (Sun) compilers on Linux, but Open MPI eventually fails to compile. What do I do?
What configure options should I use when building with the Oracle Solaris Studio (Sun) compilers?
When building with the Oracle Solaris Studio 12 Update 1 (Sun) compilers on x86 Linux, the compiler loops on btl_sm.c. Is there a workaround?
How do I build OpenMPI on IBM QS22 cell blade machines with GCC and XLC/XLF compilers?
I'm trying to build with the PathScale 3.0 and 3.1 compilers on Linux, but all Open MPI commands seg fault. What do I do?
All MPI C++ API functions return errors (or otherwise fail) when Open MPI is compiled with the PathScale compilers. What do I do?
How do I build Open MPI with support for [my favorite network type]?
How do I build Open MPI with support for Slurm / XGrid?
How do I build Open MPI with support for SGE?
How do I build Open MPI with support for PBS Pro / Open PBS / Torque?
How do I build Open MPI with support for LoadLeveler?
How do I build Open MPI with support for Platform LSF?
How do I build Open MPI with processor affinity support?
How do I build Open MPI with memory affinity / NUMA support (e.g., libnuma)?
How do I build Open MPI with CUDA-aware support?
How do I not build a specific plugin / component for Open MPI?
What other options to configure exist?
Why does compiling the Fortran 90 bindings take soooo long?
Does Open MPI support MPI_REAL16 and MPI_COMPLEX32?
Can I re-locate my Open MPI installation without re-configuring/re-compiling/re-installing from source?
How do I statically link to the libraries of Intel compiler suite?
Why do I get errors about hwloc or libevent not found?
Should I use the bundled hwloc and Libevent, or system-installed versions?
I'm still having problems / my problem is not listed here. What do I do?
Why does my MPI application fail to compile, complaining that various MPI APIs/symbols are undefined?
Why on earth are you breaking the compilation of MPI applications?
Why am I getting deprecation warnings when compiling my MPI application?
How do I update my MPI application to stop using MPI_ADDRESS?
How do I update my MPI application to stop using MPI_ERRHANDLER_CREATE?
How do I update my MPI application to stop using MPI_ERRHANDLER_GET?
How do I update my MPI application to stop using MPI_ERRHANDLER_SET?
How do I update my MPI application to stop using MPI_TYPE_HINDEXED?
How do I update my MPI application to stop using MPI_TYPE_HVECTOR?
How do I update my MPI application to stop using MPI_TYPE_STRUCT?
How do I update my MPI application to stop using MPI_TYPE_EXTENT?
How do I update my MPI application to stop using MPI_TYPE_LB?
How do I update my MPI application to stop using MPI_TYPE_UB?
How do I update my MPI application to stop using MPI_LB / MPI_UB?
How do I update my MPI application to stop using MPI_COMBINER_HINDEXED_INTEGER, MPI_COMBINER_HVECTOR_INTEGER, and MPI_COMBINER_STRUCT_INTEGER?
How do I update my MPI application to stop using MPI_Handler_function?
In general, how do I build MPI applications with Open MPI?
Wait — what is mpifort? Shouldn't I use mpif77 and mpif90?
I can't / don't want to use Open MPI's wrapper compilers. What do I do?
How do I override the flags specified by Open MPI's wrapper compilers? (v1.0 series)
How do I override the flags specified by Open MPI's wrapper compilers? (v1.1 series and beyond)
How can I tell what the wrapper compiler default flags are?
Why does "mpicc --showme <some flags>" not show any MPI-relevant flags?
Are there ways to just add flags to the wrapper compilers?
Why don't the wrapper compilers add "-rpath" (or similar) flags by default? (version v1.7.3 and earlier)
Why do the wrapper compilers add "-rpath" (or similar) flags by default? (version v1.7.4 and beyond)
Can I build 100% static MPI applications?
Can I build 100% static OpenFabrics / OpenIB / OFED MPI applications on Linux?
Why does it take soooo long to compile F90 MPI applications?
How do I build BLACS with Open MPI?
How do I build ScaLAPACK with Open MPI?
How do I build PETSc with Open MPI?
How do I build VASP with Open MPI?
Are other language / application bindings available for Open MPI?
Why does my legacy MPI application fail to compile with Open MPI v4.0.0 (and beyond)?
What prerequisites are necessary for running an Open MPI job?
What ABI guarantees does Open MPI provide?
Do I need a common filesystem on all my nodes?
How do I add Open MPI to my PATH and LD_LIBRARY_PATH?
What if I can't modify my PATH and/or LD_LIBRARY_PATH?
How do I launch Open MPI parallel jobs?
How do I run a simple SPMD MPI job?
How do I run an MPMD MPI job?
How do I specify the hosts on which my MPI job runs?
I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why?
How can I diagnose problems when running across multiple hosts?
When I build Open MPI with the Intel compilers, I get warnings about "orted" or my MPI application not finding libimf.so. What do I do?
When I build Open MPI with the PGI compilers, I get warnings about "orted" or my MPI application not finding libpgc.so. What do I do?
When I build Open MPI with the PathScale compilers, I get warnings about "orted" or my MPI application not finding libmv.so. What do I do?
Can I run non-MPI programs with mpirun / mpiexec?
Can I run GUI applications with Open MPI?
Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?
What other options are available to mpirun?
How do I use the --hostfile option to mpirun?
How do I use the --host option to mpirun?
How do I control how my processes are scheduled across nodes?
I'm not using a hostfile. How are slots calculated?
Can I run multiple parallel processes on a uniprocessor machine?
Can I oversubscribe nodes (run more processes than processors)?
Can I force Agressive or Degraded performance modes?
How do I run with the TotalView parallel debugger?
How do I run with the DDT parallel debugger?
What launchers are available?
How do I specify to the rsh launcher to use rsh or ssh?
How do I run with the Slurm and PBS/Torque launchers?
Can I suspend and resume my MPI job?
How do I run with LoadLeveler?
How do I load libmpi at runtime?
What MPI environmental variables exist?
How do I get my MPI job to wireup its MPI connections right away?
What kind of CUDA support exists in Open MPI?
What are the Libfabric (OFI) components in Open MPI?
How can Open MPI communicate with Intel Omni-Path Architecture (OPA) based devices?
Open MPI tells me that it fails to load components with a "file not found" error — but the file is there! Why does it say this?
I see strange messages about missing symbols in my application; what do these mean?
What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it?
Can I build shared libraries on AIX with the IBM XL compilers?
Why am I getting a seg fault in libopen-pal (or libopal)?
Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler?
All my MPI applications segv! Why? (Intel Linux 12.1 compiler)
Why can't I attach my parallel debugger (TotalView, DDT, fx2, etc.) to parallel jobs?
When launching large MPI jobs, I see messages like: mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying
How do I find out what MCA parameters are being seen/used by my job?
How do I debug Open MPI processes in parallel?
What tools are available for debugging in parallel?
How do I run with parallel debuggers?
What controls does Open MPI have that aid in debugging?
Do I need to build Open MPI with compiler/linker debugging flags (such as -g) to be able to debug MPI applications?
Can I use serial debuggers (such as gdb) to debug MPI applications?
My process dies without any output. Why?
What is Memchecker?
What kind of errors can Memchecker find?
How can I use Memchecker?
How to run my MPI application with Memchecker?
Does Memchecker cause performance degradation to my application?
Is Open MPI 'Valgrind-clean' or how can I identify real errors?
Can I make Open MPI use rsh instead of ssh?
What prerequisites are necessary for running an Open MPI job under rsh/ssh?
How can I make ssh not ask me for a password?
What is a .rhosts file? Do I need it?
Should I use + in my .rhosts file?
What versions of BProc does Open MPI work with?
What prerequisites are necessary for running an Open MPI job under BProc?
How do I run jobs under Torque / PBS Pro?
Does Open MPI support Open PBS?
How does Open MPI get the list of hosts from Torque / PBS Pro?
What happens if $PBS_NODEFILE is modified?
Can I specify a hostfile or use the --host option to mpirun when running in a Torque / PBS environment?
How do I determine if Open MPI is configured for Torque/PBS Pro?
How do I run with the SGE launcher?
Does the SGE tight integration support the -notify flag to qsub?
Can I suspend and resume my job?
How do I run jobs under Slurm?
Does Open MPI support "srun -n X my_mpi_application"?
I use Slurm on a cluster with the OpenFabrics network stack. Do I need to do anything special?
My job fails / performs poorly when using mpirun under Slurm 20.11
How do I reduce startup time for jobs on large clusters?
Where should I put my libraries: Network vs. local filesystems?
Static vs shared libraries?
How do I reduce the time to wireup OMPI's out-of-band communication system?
Why is my job failing because of file descriptor limits?
I know my cluster's configuration - how can I take advantage of that knowledge?
What is the Modular Component Architecture (MCA)?
What are MCA parameters?
What frameworks are in Open MPI?
What frameworks are in Open MPI v1.2 (and prior)?
What frameworks are in Open MPI v1.3?
What frameworks are in Open MPI v1.4 (and later)?
How do I know what components are in my Open MPI installation?
How do I install my own components into an Open MPI installation?
How do I know what MCA parameters are available?
How do I set the value of MCA parameters?
What are Aggregate MCA (AMCA) parameter files?
How do I set application specific environment variables in global parameter files?
How do I select which components are used?
What is processor affinity? Does Open MPI support it?
What is memory affinity? Does Open MPI support it?
How do I tell Open MPI to use processor and/or memory affinity?
How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.2.x? (What is mpi_paffinity_alone?)
How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.3.x? (What are rank files?)
How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.4.x? (How do I use the --by* and --bind-to-* options?)
How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.5.x?
How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.6 (and beyond)?
Does Open MPI support calling fork(), system(), or popen() in MPI processes?
I want to run some performance benchmarks with Open MPI. How do I do that?
I am getting a MPI_Win_free error from IMB-EXT — what do I do?
What is the vader BTL?
What is the sm BTL?
How do I specify use of sm for MPI messages?
How does the sm BTL work?
Why does my MPI job no longer start when there are too many processes on one node?
How do I know what MCA parameters are available for tuning MPI performance?
How can I tune these parameters to improve performance?
Where is the file that sm will mmap in?
Why am I seeing incredibly poor performance with the sm BTL?
Can I use SysV instead of mmap?
How much shared memory will my job use?
How much shared memory do I need?
How can I decrease my shared-memory usage?
How do I specify to use the IP network for MPI messages?
But wait — I'm using a high-speed network. Do I have to disable the TCP BTL?
How do I know what MCA parameters are available for tuning MPI performance?
Does Open MPI use the IP loopback interface?
I have multiple IP networks on some/all of my cluster nodes. Which ones will Open MPI use?
I'm getting TCP-related errors. What do they mean?
How do I tell Open MPI which IP interfaces / networks to use?
Does Open MPI open a bunch of sockets during MPI_INIT?
Are there any Linux kernel TCP parameters that I should set?
How does Open MPI know which IP addresses are routable to each other in Open MPI 1.2?
How does Open MPI know which IP addresses are routable to each other in Open MPI 1.3 (and beyond)?
Does Open MPI ever close TCP sockets?
Does Open MPI support IP interfaces that have more than one IP address?
Does Open MPI support virtual IP interfaces?
Why do I only see 5 Gbps bandwidth benchmark results on 10 GbE or faster networks?
Can I use multiple TCP connections to improve network performance?
What Myrinet-based components does Open MPI have?
How do I specify to use the Myrinet GM network for MPI messages?
How do I specify to use the Myrinet MX network for MPI messages?
But wait — I also have a TCP network. Do I need to explicitly disable the TCP BTL?
How do I know what MCA parameters are available for tuning MPI performance?
I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help?
How do I adjust the MX first fragment size? Are there constraints?
What Open MPI components support InfiniBand / RoCE / iWARP?
What component will my OpenFabrics-based network use by default?
Does Open MPI support iWARP?
Does Open MPI support RoCE (RDMA over Converged Ethernet)?
I have an OFED-based cluster; will Open MPI work with that?
Where do I get the OFED software from?
Isn't Open MPI included in the OFED software package? Can I install another copy of Open MPI besides the one that is included in OFED?
What versions of Open MPI are in OFED?
Why are you using the name "openib" for the BTL name?
Is the mVAPI-based BTL still supported?
How do I specify to use the OpenFabrics network for MPI messages? (openib BTL)
But wait — I also have a TCP network. Do I need to explicitly disable the TCP BTL?
How do I know what MCA parameters are available for tuning MPI performance?
I'm experiencing a problem with Open MPI on my OpenFabrics-based network; how do I troubleshoot and get help?
What is "registered" (or "pinned") memory?
I'm getting errors about "error registering openib memory"; what do I do? (openib BTL)
How can a system administrator (or user) change locked memory limits?
I'm still getting errors about "error registering openib memory"; what do I do? (openib BTL)
Open MPI is warning me about limited registered memory; what does this mean?
I'm using Mellanox ConnectX HCA hardware and seeing terrible latency for short messages; how can I fix this?
How much registered memory is used by Open MPI? Is there a way to limit it? (openib BTL)
How do I get Open MPI working on Chelsio iWARP devices? (openib BTL)
I'm getting "ibv_create_qp: returned 0 byte(s) for max inline data" errors; what is this, and how do I fix it? (openib BTL)
My bandwidth seems [far] smaller than it should be; why? Can this be fixed? (openib BTL)
How do I tune small messages in Open MPI v1.1 and later versions? (openib BTL)
How do I tune large message behavior in Open MPI the v1.2 series? (openib BTL)
How do I tune large message behavior in the Open MPI v1.3 (and later) series? (openib BTL)
How does the mpi_leave_pinned parameter affect large message transfers? (openib BTL)
How does the mpi_leave_pinned parameter affect memory management? (openib BTL)
How does the mpi_leave_pinned parameter affect memory management in Open MPI v1.2? (openib BTL)
How does the mpi_leave_pinned parameter affect memory management in Open MPI v1.3? (openib BTL)
How can I set the mpi_leave_pinned MCA parameter? (openib BTL)
I got an error message from Open MPI about not using the default GID prefix. What does that mean, and how do I fix it? (openib BTL)
What subnet ID / prefix value should I use for my OpenFabrics networks?
How do I set my subnet ID?
In a configuration with multiple host ports on the same fabric, what connection pattern does Open MPI use? (openib BTL)
I'm getting lower performance than I expected. Why?
I get bizarre linker warnings / errors / run-time faults when I try to compile my OpenFabrics MPI application statically. How do I fix this?
Can I use system(), popen(), or fork() in an MPI application that uses the OpenFabrics support? (openib BTL)
My MPI application sometimes hangs when using the openib BTL; how can I fix this? (openib BTL)
Does InfiniBand support QoS (Quality of Service)?
Does Open MPI support InfiniBand clusters with torus/mesh topologies? (openib BTL)
How do I tell Open MPI which IB Service Level to use? (openib BTL)
How do I tell Open MPI which IB Service Level to use? (UCX PML)
What is RDMA over Converged Ethernet (RoCE)?
How do I run Open MPI over RoCE? (openib BTL)
How do I run Open MPI over RoCE? (UCX PML)
Does Open MPI support XRC? (openib BTL)
How do I specify the type of receive queues that I want Open MPI to use? (openib BTL)
Does Open MPI support FCA?
Does Open MPI support MXM?
Does Open MPI support UCX?
I'm getting errors about "initializing an OpenFabrics device" when running v4.0.0 with UCX support enabled. What should I do?
How can I find out what devices and transports are supported by UCX on my system?
What is cpu-set?
Does Open MPI support connecting hosts from different subnets? (openib BTL)
What versions of Open MPI contain support for uDAPL?
What is different between Sun Microsystems ClusterTools 7 and Open MPI in regards to the uDAPL BTL?
What values are expected to be used by the btl_udapl_if_include and btl_udapl_if_exclude MCA parameters?
Where is the static uDAPL Registry found?
How come the value reported by ifconfig is not accepted by the btl_udapl_if_include/btl_udapl_if_exclude MCA parameter?
I get a warning message about not being able to register memory and possibly out of privileged memory while running on Solaris; what can I do?
What is special about MPI performance analysis?
What are "profiling" and "tracing"?
How do I sort out busy wait time from idle wait, user time from system time, and so on?
What is PMPI?
Should I use those switches --enable-mpi-profile and --enable-trace when I configure OMPI?
What support does OMPI have for performance analysis?
How do I view VampirTrace output?
Are there MPI performance analysis tools for OMPI that I can download for free?
Any other kinds of tools I should know about?
How does Open MPI handle HFS+ / UFS filesystems?
How do I use the Open MPI wrapper compilers in XCode?
What versions of Open MPI support XGrid?
How do I run jobs under XGrid?
Where do I get more information about running under XGrid?
Is Open MPI included in OS X?
How do I not use the OS X-bundled Open MPI?
I am using Open MPI 2.0.x / v2.1.x and getting an error at application startup. How do I work around this?
Is AIX a supported operating system for Open MPI?
Does Open MPI work on AIX?
What is VampirTrace?
Where can I find the complete documentation of VampirTrace?
How to instrument my MPI application with VampirTrace?
Does VampirTrace cause overhead to my application?
How can I change the underlying compiler of the mpi*-vt wrappers?
How can I pass VampirTrace related configure options through the Open MPI configure?
How to disable the integrated VampirTrace, completely?

1. What is MPI? What is Open MPI?

MPI stands for the Message Passing Interface. Written by the MPI Forum (a large committee comprised of a cross-section between industry and research representatives), MPI is a standardized API typically used for parallel and/or distributed computing. The MPI standard has been published multiple times:

MPI-1.0 (published in 1994).
MPI-2.0 (published in 1996). MPI-2.0 is, for the most part, additions and extensions to the original MPI-1.0 specification.
- MPI-2.1 and MPI-2.2 were subsequently published, and contain minor fixes, changes, and additions compared to MPI-2.0.
MPI-3.0 (published in 2012).
- MPI-3.1 was subsequently published, and contains minor fixes, changes, and additions compared to MPI-3.0.

All MPI specifications documents can be downloaded from the official MPI Forum web site: http://www.mpi-forum.org/.

Open MPI is an open source, freely available implementation of the MPI specifications. The Open MPI software achieves high performance; the Open MPI project is quite receptive to community input.

2. Where can I learn about MPI? Are there tutorials available?

There are many resources available on the internet for learning MPI.

The definitive reference for MPI is the MPI Forum Web site. It has copies of the MPI standards documents and all of the errata. This is not recommended for beginners, but is an invaluable reference.
Several books on MPI are available (search your favorite book sellers for availability):
- MPI: The Complete Reference, Marc Snir et al. (an annotated version of the MPI-1 and MPI-2 standards; a 2 volume set, also known as "The orange book" and "The yellow book")
- Using MPI, William Gropp et al. (2nd edition, also known as "The purple book")
- Parallel Programming With MPI, Peter Pacheco
- ...and others. This is not a definitive list!
The "Introduction to MPI" and "Intermediate MPI" tutorials are excellent web-based MPI instruction offered by the NCSA. This is a great place for beginners.
Last but not least, searching for "MPI tutorial" on Google turns up a wealth of information (some good, some bad).

3. What are the goals of the Open MPI Project?

We have several top-level goals:

Create a free, open source, peer-reviewed, production-quality complete MPI implementation.
Provide extremely high, competitive performance (latency, bandwidth, ...pick your favorite metric).
Directly involve the HPC community with external development and feedback (vendors, 3rd party researchers, users, etc.).
Provide a stable platform for 3rd party research and commercial development.
Help prevent the "forking problem" common to other MPI projects.
Support a wide variety of HPC platforms and environments.

In short, we want to work with and for the HPC community to make a world-class MPI implementation that can be used on a huge number and kind of systems.

4. Will you allow external involvement?

ABSOLUTELY.

Bringing together smart researchers and developers to work on a common product is not only a good idea, it's the open source model. Merging the multiple MPI implementation teams has worked extremely well for us over the past year — extending this concept to the HPC open source community is the next logical step.

The component architecture that Open MPI is founded upon (see the "Publications" link for papers about this) is designed to foster 3rd party collaboration by enabling independent developers to use Open MPI as a production quality research platform. Although Open MPI is a relatively large code base, it is rarely necessary to learn much more than the interfaces for the component type which you are implementing. Specifically, the component architecture was designed to allow small, discrete implementations of major portions of MPI functionality (e.g., point-to-point messaging, collective communications, run-time environment support, etc.).

We envision at least the following forms of collaboration:

Peer review of the Open MPI code base
Discussion with Open MPI developers on public mailing lists
Direct involvement from HPC software and hardware vendors
3rd parties writing and providing their own Open MPI components

5. How is this software licensed?

The Open MPI code base is licensed under the new BSD license.

That being said, although we are an open source project, we recognize that everyone does not provide free, open source software. Our collaboration models allow (and encourage!) 3rd parties to write and distribute their own components — perhaps with a different license, and perhaps even as closed source. This is all perfectly acceptable (and desirable!).

6. I want to redistribute Open MPI. Can I?

Absolutely.

NOTE: We are not lawyers and this is not legal advice.

Please read the Open MPI license (the BSD license). It contains extremely liberal provisions for redistribution.

7. Preventing forking is a goal; how will you enforce that?

By definition, we can't. If someone really wants to fork the Open MPI code base, they can. By virtue of our extremely liberal license, it is possible for anyone to fork at any time.

However, we hope that no one does.

We intend to distinguish ourselves from other projects by:

Working with the HPC community to accept best-in-breed improvements and functionality enhancements.
Providing a flexible framework and set of APIs that allow a wide variety of different goals within the same code base through the combinatorial effect of mixing-and-matching different components.

Hence, we hope that no one ever has a reason to fork the main code base. We intend to work with the community to accept the best improvements back into the main code base. And if some developers want to do things to the main code base that are different from the goals of the main Open MPI Project, it is our hope that they can do what they need in components that can be distributed without forking the main Open MPI code base.

Only time will tell if this ambitious plan is feasible, but we're going to work hard to make it a reality!

8. How are 3rd party contributions handled?

Before accepting any code from 3rd parties, we require an original signed contribution agreement from the donator.

These agreements assert that the contributor has the right to donate the code and allow the Open MPI Project to perpetually distribute it under the project's licensing terms.

This prevents a situation where intellectual property gets into the Open MPI code base and then someone later claims that we owe them money for it. Open MPI is a free, open source code base. And we intend it to remain that way.

The Contributing to Open MPI FAQ topic contains more information on this issue.

9. Is this just YAMPI (yet another MPI implementation)?

No!

Open MPI initially represented the merger between three well-known MPI implementations (none of which are being developed any more):

FT-MPI from the University of Tennessee
LA-MPI from Los Alamos National Laboratory
LAM/MPI from Indiana University

with contributions from the PACX-MPI team at the University of Stuttgart.

Each of these MPI implementations excelled in one or more areas. The driving motivation behind Open MPI is to bring the best ideas and technologies from the individual projects and create one world-class open source MPI implementation that excels in all areas.

Open MPI was started with the best of the ideas from these four MPI implementations and ported them to an entirely new code base: Open MPI. This also had the simultaneous effect of enabling us to jettison old, crufty code that was only maintained for historical reasons from each project. We started with a clean slate and decided to "do it Right this time." As such, Open MPI also contains many new designs and methodologies based on (literally) years of MPI implementation experience.

After version 1.0 was released, the Open MPI Project grew to include many other members who have each brought their knowledge, expertise, and resources to Open MPI. Open MPI is now far more than just the best ideas of the founding for MPI implementation projects.

10. But I love [FT-MPI | LA-MPI | LAM/MPI | PACX-MPI]! Why should I use Open MPI?

Here's a few reasons:

Open MPI represents the next generation of each of these implementations.
Open MPI effectively contains the union of features from each of the previous MPI projects. If you find a feature in one of the prior projects that is not in Open MPI, chances are that it will be soon.
The vast majority of our future research and development work will be in Open MPI.
All the same developers from your favorite project are working on Open MPI.

Not to worry — each of the respective teams has a vested interest in bringing over the "best" parts of their prior implementation to Open MPI. Indeed, we would love to migrate each of our current user bases to Open MPI as their time, resources, and constraints allow.

In short: we believe that Open MPI — its code, methodology, and open source philosophy — is the future.

11. What will happen to the prior projects?

Only time will tell (we cannot predict the future), but it is likely that each project will eventually either end when funding stops or be used exclusively as a research vehicle. Indeed, some of the projects must continue to exist at least until their existing funding expires.

12. What operating systems does Open MPI support?

We primarily develop Open MPI on Linux and OS X.

Other operating systems are supported, however. The exact list of operating systems supported has changed over time (e.g., native Microsoft Windows support was added in v1.3.3, and although it was removed prior to v1.8, is still supported through Cygwin). See the README file in your copy of Open MPI for a listing of the OSes that that version supports.

Open MPI is fairly POSIX-neutral, so it will run without too many modifications on most POSIX-like systems. Hence, if we haven't listed your favorite operating system here, it should not be difficult to get Open MPI to compile and run properly. The biggest obstacle is typically the assembly language, but that's fairly modular and we're happy to provide information about how to port it to new platforms.

It should be noted that we are quite open to accepting patches for operating systems that we do not currently support. If we do not have systems to test these on, we probably will only claim to "unofficially" support those systems.

13. What hardware platforms does Open MPI support?

Essentially all the common platforms that the operating systems listed in the previous question support.

For example, Linux runs on a wide variety of platforms, and we certainly can't claim to support all of them. Open MPI includes Linux-compiler-based assembly for support of Intel, AMD, and PowerPC chips, for example.

14. What network interconnects does Open MPI support?

Open MPI is based upon a component architecture; support for its MPI point-to-point functionality only utilizes a small number of components at run-time. Adding native support for a new network interconnect was specifically designed to be easy.

The list of supported interconnects has changed over time. You should consult your copy of Open MPI to see exactly which interconnects it supports. The table below shows various interconnects and the versions in which they were supported in Open MPI (in alphabetical order):

Interconnect / Library stack name	Support type	Introduced in Open MPI series	Removed after Open MPI series

Elan	BTL	1.3	1.6

InfiniBand MXM	MTL	1.5	3.1
InfiniBand MXM	PML	1.5

InfiniBand / RoCE / iWARP Verbs	BTL	1.0
InfiniBand / RoCE / iWARP Verbs	PML	3.0

InfiniBand mVAPI	BTL	1.0	1.2

Libfabric	MTL	1.10

Loopback (send-to-self)	BTL	1.0

Myrinet GM	BTL	1.0	1.4
Myrinet MX	BTL	1.0	1.6
Myrinet MX	MTL	1.2	1.8

Portals	BTL	1.0	1.6
Portals	MTL	1.2	1.6
Portals4	MTL	1.7

PSM	MTL	1.2
PSM2	MTL	1.10

SCIF	BTL	1.8	3.1

SCTP	BTL	1.5	1.6

Shared memory	BTL	1.0

TCP sockets	BTL	1.0

uDAPL	BTL	1.2	1.6

uGNI	BTL	1.7

usNIC	BTL	1.8

Is there a network that you'd like to see supported that is not shown above? Contributions are welcome!

15. What run-time environments does Open MPI support?

Open MPI is layered on top of the Open Run-Time Environment (ORTE), which originally started as a small portion of the Open MPI code base. However, ORTE has effectively spun off into its own sub-project.

ORTE is a modular system that was specifically architected to abstract away the back-end run-time environment (RTE) system, providing a neutral API to the upper-level Open MPI layer. Components can be written for ORTE that allow it to natively utilize a wide variety of back-end RTEs.

ORTE currently natively supports the following run-time environments:

Recent versions of BProc (e.g., Clustermatic, pre-1.3 only)
Sun Grid Engine
PBS Pro, Torque, and Open PBS (the TM system)
LoadLeveler
LSF
POE (pre-1.8 only)
rsh / ssh
Slurm
XGrid (pre-1.3 only)
Yod (Red Storm, pre-1.5 only)

Is there a run-time system that you'd like to use Open MPI with that is not listed above? Component contributions are welcome!

16. Does Open MPI support LSF?

Starting with Open MPI v1.3, yes!

Prior to Open MPI v1.3, Platform (which is now IBM) released a script-based integration in the LSF 6.1 and 6.2 maintenance packs around November of 2006. If you want this integration, please contact your normal IBM support channels.

17. How much MPI does Open MPI support?

Open MPI 1.2 supports all of MPI-2.0.

Open MPI 1.3 supports all of MPI-2.1.

Open MPI 1.8 supports all of MPI-3.

Starting with v2.0, Open MPI supports all of MPI-3.1

18. Is Open MPI thread safe?

Support for MPI_THREAD_MULTIPLE (i.e., multiple threads executing within the MPI library) and asynchronous message passing progress (i.e., continuing message passing operations even while no user threads are in the MPI library) has been designed into Open MPI from its first planning meetings.

Support for MPI_THREAD_MULTIPLE was included in the first version of Open MPI, but it only became robust around v3.0.0. Subsequent releases continually improve reliability and performance of multi-threaded MPI applications.

19. Does Open MPI support 32 bit environments?

As far as we know, yes. 64 bit architectures have effectively taken over the world, though, so 32-bit is not tested nearly as much as 64-bit.

Specifically, most of the Open MPI developers only have 64-bit machines, and therefore only test 32-bit in emulation mode.

20. Does Open MPI support 64 bit environments?

Yes, Open MPI is 64 bit clean. You should be able to use Open MPI on 64 bit architectures and operating systems with no difficulty.

21. Does Open MPI support execution in heterogeneous environments?

As of v1.1, Open MPI requires that the size of C, C++, and Fortran datatypes be the same on all platforms within a single parallel application, with the exception of types represented by MPI_BOOL and MPI_LOGICAL — size differences in these types between processes are properly handled. Endian differences between processes in a single MPI job are properly and automatically handled.

Prior to v1.1, Open MPI did not include any support for data size or endian heterogeneity.

22. Does Open MPI support parallel debuggers?

Yes. Open MPI supports the TotalView API for parallel process attaching, which several parallel debuggers support (e.g., DDT, fx2). As part of v1.2.4 (released in September 2007), Open MPI also supports the TotalView API for viewing message queues in running MPI processes.

See this FAQ entry for details on how to run Open MPI jobs under TotalView, and this FAQ entry for details on how to run Open MPI jobs under DDT.

NOTE: The integration of Open MPI message queue support is problematic with 64 bit versions of TotalView prior to v8.3:

The message queues views will be truncated.
Both the communicators and requests list will be incomplete.
Both the communicators and requests list may be filled with wrong values (such as an MPI_Send to the destination ANY_SOURCE).

There are two workarounds:

Use a 32 bit version of TotalView
Upgrade to TotalView v8.3

23. Can I contribute to Open MPI?

YES!

One of the main goals of the Open MPI project is to involve the greater HPC community.

There are many ways to contribute to Open MPI. Here are a few:

Subscribe to the mailing lists and become active in the discussions
Obtain a source code checkout of Open MPI's code base and start looking through the code (be sure to see the Developers category for technical details about the code base)
Write your own components and distribute them yourself (i.e., outside of the main Open MPI distribution)
Write your own components and contribute them back to the main code base
Contribute bug fixes and feature enhancements to the main code base

24. I found a bug! How do I report it?

First check that this is not already a known issue by checking the FAQ and the mailing list archives. If you can't find your problem mentioned anywhere, it is most helpful if you can create a "recipe" to replicate the bug.

Please see the Getting Help page for more details on submitting bug reports.

25. What license is Open MPI distributed under?

Open MPI is distributed under the 3-clause BSD license.

26. How do I contribute code to Open MPI?

We love code contributions!

All code contributions are submitted as pull requests on the Open MPI GitHub project.

We need to have an established intellectual property pedigree of the code in Open MPI. This means being able to ensure that all code included in Open MPI is free, open source, and able to be distributed under the BSD license. This prevents a situation where intellectual property gets into the Open MPI code base and then someone later claims that we owe them money for it. Open MPI is a free, open source code base. And we intend it to remain that way.

We enforce this policy by requiring all git commits to include a "Signed-off-by" token in the commit message, indicating your agreement to the Open MPI Contributor's Declaration.

27. I can't submit an Open MPI Third Party Contribution Agreement; how can I contribute to Open MPI?

This question is obsolete (as of November 2016). The Open MPI project used to require a signed Open MPI Third Party Contribution Agreement before we could accept code contributions.

However, we have changed our policy and now only require agreement with the Open MPI Contributor's Declaration.

See this FAQ entry for more details.

If you are unable to agree to the Contributor's Declaration, fear not — there are other ways to contribute to Open MPI. Here are some examples:

Become an active participant in the mailing lists
Write and distribute your own components (remember: Open MPI components can be distributed completely separately from the main Open MPI distribution — they can be added to existing Open MPI installations, and don't even need to be open source)
Report bugs
Do a good deed daily

28. What if I don't want my contribution to be free / open source?

No problem.

While we are creating free / open-source software, and we would prefer if everyone's contributions to Open MPI were also free / open-source, we certainly recognize that other organizations have different goals from us. Such is the reality of software development in today's global economy.

As such, it is perfectly acceptable to make non-free / non-open-source contributions to Open MPI.

We obviously cannot accept such contributions into the main code base, but you are free to distribute plugins, enhancements, etc. as you see fit. Indeed, the the BSD license is extremely liberal in its redistribution provisions.

Please also see this FAQ entry about forking the Open MPI code base.

29. I want to fork the Open MPI code base. Can I?

Yes... but we'd prefer if you didn't.

Although Open MPI's license allows third parties to fork the code base, we would strongly prefer if you did not. Forking is not necessarily a Bad Thing, but history has shown that creating too many forks in MPI implementations leads to massive user and system administrator confusion. We have personally seen parallel environments loaded with tens of MPI implementations, each only slightly different from the others. The users then become responsible for figuring out which MPI they want / need to use, which can be a daunting and confusing task.

We do periodically have "short" forks. Specifically, sometimes an origanization needs to release a version of Open MPI with a specific feature.

If you're thinking of forking the Open MPI code base, please let us know — let's see if we can work something out so that it is not necessary.

30. Rats! My contribution was not accepted into the main Open MPI code base. What now?

If your contribution was not accepted into the main Open MPI code base, there are likely to be good reasons for it (perhaps technical, perhaps due to licensing restrictions, etc.).

If you wrote a standalone component, you can still distribute this component independent of the main Open MPI distribution. Open MPI components can be installed into existing Open MPI installations. As such, you can distribute your component — even if it is closed source (e.g., distributed as binary-only) — via any mechanism you choose, such as on a web site, FTP site, etc.

31. Open MPI terminology

Open MPI is a large project containing many different sub-systems and a relatively large code base. Let's first cover some fundamental terminology in order to make the rest of the discussion easier.

Open MPI has three sections of code:

OMPI: The MPI API and supporting logic
ORTE: The Open Run-Time Environment (support for different back-end run-time systems)
OPAL: The Open Portable Access Layer (utility and "glue" code used by OMPI and ORTE)

There are strict abstraction barriers in the code between these sections. That is, they are compiled into three separate libraries: libmpi, liborte, and libopal with a strict dependency order: OMPI depends on ORTE and OPAL, and ORTE depends on OPAL. More specifically, OMPI executables are linked with:

1
2
3

shell$ mpicc myapp.c -o myapp
# This actually turns into:
shell$ cc myapp.c -o myapp -lmpi -lopen-rte -lopen-pal ...

More system-level libraries may listed after -lopal, but you get the idea.

Strictly speaking, these are not "layers" in the classic software engineering sense (even though it is convenient to refer to them as such). They are listed above in dependency order, but that does not mean that, for example, the OMPI code must go through the ORTE and OPAL code in order to reach the operating system or a network interface.

As such, this code organization more reflects abstractions and software engineering, not a strict hierarchy of functions that must be traversed in order to reach a lower layer. For example, OMPI can call OPAL functions directly — it does not have to go through ORTE. Indeed, OPAL has a different set of purposes than ORTE, so it wouldn't even make sense to channel all OPAL access through ORTE. OMPI can also directly call the operating system as necessary. For example, many top-level MPI API functions are quite performance sensitive; it would not make sense to force them to traverse an arbitrarily deep call stack just to move some bytes across a network.

Here's a list of terms that are frequently used in discussions about the Open MPI code base:

MCA: The Modular Component Architecture (MCA) is the foundation upon which the entire Open MPI project is built. It provides all the component architecture services that the rest of the system uses. Although it is the fundamental heart of the system, its implementation is actually quite small and lightweight — it is nothing like CORBA, COM, JINI, or many other well-known component architectures. It was designed for HPC — meaning that it is small, fast, and reasonably efficient — and therefore offers few services other than finding, loading, and unloading components.
Framework: An MCA framework is a construct that is created for a single, targeted purpose. It provides a public interface that is used by external code, but it also has its own internal services. A list of Open MPI frameworks is available here. An MCA framework uses the MCA's services to find and load components at run-time — implementations of the framework's interface. An easy example framework to discuss is the MPI framework named "btl", or the Byte Transfer Layer. It is used to send and receive data on different kinds of networks. Hence, Open MPI has btl components for shared memory, TCP, Infiniband, Myrinet, etc.
Component: An MCA component is an implementation of a framework's interface. Another common word for component is "plugin". It is a standalone collection of code that can be bundled into a plugin that can be inserted into the Open MPI code base, either at run-time and/or compile-time.
Module: An MCA module is an instance of a component (in the C++ sense of the word "instance"; an MCA component is analogous to a C++ class). For example, if a node running an Open MPI application has multiple ethernet NICs, the Open MPI application will contain one TCP btl component, but two TCP btl modules. This difference between components and modules is important because modules have private state; components do not.

Frameworks, components, and modules can be dynamic or static. That is, they can be available as plugins or they may be compiled statically into libraries (e.g., libmpi).

32. How do I get a copy of the most recent source code?

See the instructions here.

33. Ok, I got a Git checkout. Now how do I build it?

See the instructions here.

34. What is the main tree layout of the Open MPI source tree? Are there directory name conventions?

There are a few notable top-level directories in the source tree:

config/: M4 scripts supporting the top-level configure script mpi.h)
etc/: Some miscellaneous text files
include/: Top-level include files that will be installed
ompi/: The Open MPI code base
orte/: The Open RTE code base
opal/: The OPAL code base

Each of the three main source directories (ompi/, orte/, and opal/) generate a top-level library named libmpi, liborte, and libopal, respectively. They can be built as either static or shared libraries. Executables are also produced in subdirectories of some of the trees.

Each of the sub-project source directories have similar (but not identical) directory structures under them:

class/: C++-like "classes" (using the OPAL class system) specific to this project
include/: Top-level include files specific to this project
mca/: MCA frameworks and components specific to this project
runtime/: Startup and shutdown of this project at runtime
tools/: Executables specific to this project (currently none in OPAL)
util/: Random utility code

There are other top-level directories in each of the three sub-projects, each having to do with specific logic and code for that project. For example, the MPI API implementations can be found under ompi/mpi/LANGUAGE, where LANGUAGE is c, cxx, f77, and f90.

The layout of the mca/ trees are strictly defined. They are of the form:

<project>/mca/<framework name>/<component name>/

To be explicit: it is forbidden to have a directory under the mca trees that does not meet this template (with the exception of base directories, explained below). Hence, only framework and component code can be in the mca/ trees.

That is, framework and component names must be valid directory names (and C variables; more on that later). For example, the TCP BTL component is located in the following directory:

# In v1.6.x and earlier: ompi/mca/btl/tcp/

# In v1.7.x and later: opal/mca/btl/tcp/

The name base is reserved; there cannot be a framework or component named "base." Directories named base are reserved for the implementation of the MCA and frameworks. Here are a few examples (as of the v1.8 source tree):

# Main implementation of the MCA opal/mca/base

# Implementation of the btl framework opal/mca/btl/base

# Implementation of the rml framework orte/mca/rml/base

# Implementation of the pml framework ompi/mca/pml/base

Under these mandated directories, frameworks and/or components may have arbitrary directory structures, however.

35. Is there more information available?

Yes. In early 2006, Cisco hosted an Open MPI workshop where the Open MPI Team provided several days of intensive dive-into-the-code tutorials. The slides from these tutorials are available here.

Additionally, Greenplum videoed several Open MPI developers discussing Open MPI internals in 2012. The videos are available here.

36. I'm a sysadmin; what do I care about Open MPI?

Several members of the Open MPI team have strong system administrator backgrounds; we recognize the value of having software that is friendly to system administrators. Here are some of the reasons that Open MPI is attractive for system administrators:

Simple, standards-based installation
Reduction of the number of MPI installations
Ability to set system-level and user-level parameters
Scriptable information sources about the Open MPI installation

See the rest of the questions in this FAQ section for more details.

37. What hardware / software / run-time environments / networks does Open MPI support?

See this FAQ category for more information

38. Do I need multiple Open MPI installations?

Yes and no.

Open MPI can handle a variety of different run-time environments (e.g., rsh/ssh, Slurm, PBS, etc.) and a variety of different interconnection networks (e.g., ethernet, Myrinet, Infiniband, etc.) in a single installation. Specifically: because Open MPI is fundamentally powered by a component architecture, plug-ins for all these different run-time systems and interconnect networks can be installed in a single installation tree. The relevant plug-ins will only be used in the environments where they make sense.

Hence, there is no need to have one MPI installation for Myrinet, one MPI installation for ethernet, one MPI installation for PBS, one MPI installation for rsh, etc. Open MPI can handle all of these in a single installation.

However, there are some issues that Open MPI cannot solve. Binary compatibility between different compilers is such an issue. Let's examine this on a per-language basis (be sure see the big caveat at the end):

C: Most C compilers are fairly compatible, such that if you compile Open MPI with one C library and link it to an application that was compiled with a different C compiler, everything "should just work." As such, a single installation of Open MPI should work for most C MPI applications.
C++: The same is not necessarily true for C++. Most of Open MPI's C++ code is simply the MPI C++ bindings, and in the default build, they are inlined C++ code, meaning that they should compile on any C++ compiler. Hence, you should be able to have one Open MPI installation for multiple different C++ compilers (we'd like to hear feedback either way). That being said, some of the top-level Open MPI executables are written in C++ (e.g., mpicc, ompi_info, etc.). As such, these applications may require having the C++ run-time support libraries of whatever compiler they were created with in order to run properly. Specifically, if you compile Open MPI with the XYZ C/C++ compiler, you may need to have the XYC C++ run-time libraries installed everywhere you want to run mpicc or oompi_info.
Fortran 77: Fortran 77 compilers do something called "symbol mangling," meaning that they change the names of global variables, subroutines, and functions. There are 4 common name mangling schemes in use by Fortran 77 compilers. On many systems (e.g., Linux), Open MPI will automatically support all 4 schemes. As such, a single Open MPI installation should just work with multiple different Fortran compilers. However, on some systems, this is not possible (e.g., OS X), and Open MPI will only support the name mangling scheme of the Fortran 77 compiler that was identified during configure.
Also, there are two notable exceptions that do not work across Fortran compilers that are "different enough":
1. The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE will only compare properly to Fortran applications that were created with Fortran compilers that that use the same name-mangling scheme as the Fortran compiler that Open MPI was configured with.
2. Fortran compilers may have different values for the logical .TRUE. constant. As such, any MPI function that uses the Fortran LOGICAL type may only get .TRUE. values back that correspond to the the .TRUE. value of the Fortran compiler that Open MPI was configured with.
Fortran 90: Similar to C++, linking object files from different Fortran 90 compilers is not likely to work. The F90 MPI module that Open MPI creates will likely only work with the Fortran 90 compiler that was identified during configure.

The big caveat to all of this is that Open MPI will only work with different compilers if all the datatype sizes are the same. For example, even though Open MPI supports all 4 name mangling schemes, the size of the Fortran LOGICAL type may be 1 byte in some compilers and 4 bytes in others. This will likely cause Open MPI to perform unpredictably.

The bottom line is that Open MPI can support all manner of run-time systems and interconnects in a single installation, but supporting multiple compilers "sort of" works (i.e., is subject to trial and error) in some cases, and definitely does not work in other cases. There's unfortunately little that we can do about this — it's a compiler compatibility issue, and one that compiler authors have little incentive to resolve.

39. What are MCA Parameters? Why would I set them?

MCA parameters are a way to tweak Open MPI's behavior at run-time. For example, MCA parameters can specify:

Which interconnect networks to use
Which interconnect networks not to use
The size difference between eager sends and rendezvous protocol sends
How many registered buffers to pre-pin (e.g., for GM or mVAPI)
The size of the pre-pinned registered buffers
...etc.

It can be quite valuable for a system administrator to play with such values a bit and find an "optimal" setting for a particular operating environment. These values can then be set in a global text file that all users will, by default, inherit when they run Open MPI jobs.

For example, say that you have a cluster with 2 ethernet networks — one for NFS and other system-level operations, and one for MPI jobs. The system administrator can tell Open MPI to not use the NFS TCP network at a system level, such that when users invoke mpirun or mpiexec to launch their jobs, they will automatically only be using the network meant for MPI jobs.

See the run-time tuning FAQ category for information on how to set global MCA parameters.

40. Do my users need to have their own installation of Open MPI?

Usually not. It is typically sufficient for a single Open MPI installation (or perhaps a small number of Open MPI installations, depending on compiler interoperability) to serve an entire parallel operating environment.

Indeed, a system-wide Open MPI installation can be customized on a per-user basis in two important ways:

Per-user MCA parameters: Each user can set their own set of MCA parameters, potentially overriding system-wide defaults.
Per-user plug-ins: Users can install their own Open MPI plug-ins under $HOME/.openmpi/components. Hence, developers can experiment with new components without destabilizing the rest of the users on the system. Or power users can download 3rd party components (perhaps even research-quality components) without affecting other users.

41. I have power users who will want to override my global MCA parameters; is this possible?

Absolutely.

See the run-time tuning FAQ category for information how to set MCA parameters, both at the system level and on a per-user (or per-MPI-job) basis.

42. What MCA parameters should I, the system administrator, set?

This is a difficult question and depends on both your specific parallel setup and the applications that typically run there.

The best thing to do is to use the ompi_info command to see what parameters are available and relevant to you. Specifically, ompi_info can be used to show all the parameters that are available for each plug-in. Two common places that system administrators like to tweak are:

Only allow specific networks: Say you have a cluster with a high-speed interconnect (such as Myrinet or Infiniband) and an ethernet network. The high-speed network is intended for MPI jobs; the ethernet network is intended for NFS and other administrative-level jobs. In this case, you can simply turn off Open MPI's TCP support. The "btl" framework contains Open MPI's network support; in this case, you want to disable the tcp plug-in. You can do this by adding the following line in the file $prefix/etc/openmpi-mca-params.conf:
1
btl = ^tcp
This tells Open MPI to load all BTL components except tcp.
Consider another example: your cluster has two TCP networks, one for NFS and administration-level jobs, and another for MPI jobs. You can tell Open MPI to ignore the TCP network used by NFS by adding the following line in the file $prefix/etc/openmpi-mca-params.conf:
1
btl_tcp_if_exclude = lo,eth0
The value of this parameter is the device names to exclude. In this case, we're excluding lo (localhost, because Open MPI has its own internal loopback device) and eth0.
Tune the parameters for specific networks: Each network plug-in has a variety of different tunable parameters. Use the ompi_info command to see what is available. You show all available parameters with:
1
shell$ ompi_info --param all all
NOTE: Starting with Open MPI v1.8, ompi_info categorizes its parameters in so-called levels, as defined by the MPI_T interface. You will need to specify --level 9 (or --all) to show all MCA parameters. See this blog entry for further information.
1
shell$ ompi_info --param all all --level 9
or
1
shell$ ompi_info --all
Beware: there are many variables available. You can limit the output by showing all the parameters in a specific framework or in a specific plug-in with the command line parameters:
1
shell$ ompi_info --param btl all --level 9
Shows all the parameters of all BTL components, and:
1
shell$ ompi_info --param btl tcp --level 9
Shows all the parameters of just the tcp BTL component.

43. I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps?

If your installation of Open MPI uses shared libraries and components are standalone plug-in files, then no. If you add a new component (such as support for a new network), Open MPI will simply open the new plugin at run-time — your applications do not need to be recompiled or re-linked.

44. I just upgraded my Myrinet|Infiniband network; do I need to recompile all my MPI apps?

If your installation of Open MPI uses shared libraries and components are standalone plug-in files, then no. You simply need to recompile the Open MPI components that support that network and re-install them.

More specifically, Open MPI shifts the dependency on the underlying network away from the MPI applications and to the Open MPI plug-ins. This is a major advantage over many other MPI implementations.

MPI applications will simply open the new plugin when they run.

45. We just upgraded our version of Open MPI; do I need to recompile all my MPI apps?

It is unlikely. Most MPI applications solely interact with Open MPI through the standardized MPI API and the constant values it publishes in mpi.h. The MPI-2 API will not change until the MPI Forum changes it.

We will try hard to make Open MPI's mpi.h stable such that the values will not change from release-to-release. While we cannot guarantee that they will stay the same forever, we'll try hard to make it so.

46. I have an MPI application compiled for another MPI; will it work with Open MPI?

It is strongly unlikely. Open MPI does not attempt to interface to other MPI implementations, nor executables that were compiled for them. Sorry!

MPI applications need to be compiled and linked with Open MPI in order to run under Open MPI.

47. What is "fault tolerance"?

The phrase "fault tolerance" means many things to many people. Typical definitions range from user processes dumping vital state to disk periodically to checkpoint/restart of running processes to elaborate recreate-process-state-from-incremental-pieces schemes to ... (you get the idea).

In the scope of Open MPI, we typically define "fault tolerance" to mean the ability to recover from one or more component failures in a well defined manner with either a transparent or application-directed mechanism. Component failures may exhibit themselves as a corrupted transmission over a faulty network interface or the failure of one or more serial or parallel processes due to a processor or node failure. Open MPI strives to provide the application with a consistent system view while still providing a production quality, high performance implementation.

Yes, that's pretty much as all-inclusive as possible — intentionally so! Remember that in addition to being a production-quality MPI implementation, Open MPI is also a vehicle for research. So while some forms of "fault tolerance" are more widely accepted and used, others are certainly of valid academic interest.

48. What fault tolerance techniques has/does/will Open MPI support?

Open MPI was a vehicle for research in fault tolerance and over the years provided support for a wide range of resilience techniques (striked item have seem their support deprecated):

~~Coordinated and uncoordinated process checkpoint and restart. Similar to those implemented in LAM/MPI and MPICH-V, respectively.~~
~~Message logging techniques. Similar to those implemented in MPICH-V~~
~~Data Reliability and network fault tolerance. Similar to those implemented in LA-MPI~~
User Level Fault Mitigation techniques similar to those implemented in FT-MPI.

The Open MPI team will not limit their fault tolerance techniques to those mentioned above, but intend on extending beyond them in the future.

49. Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?

Old versions of OMPI (strarting from v1.3 series) had support for the transparent, coordinated checkpointing and restarting of MPI processes (similar to LAM/MPI).

Open MPI supported both the the BLCR checkpoint/restart system and a "self" checkpointer that allows applications to perform their own checkpoint/restart functionality while taking advantage of the Open MPI checkpoint/restart infrastructure. For both of these, Open MPI provides a coordinated checkpoint/restart protocol and integration with a variety of network interconnects including shared memory, Ethernet, InfiniBand, and Myrinet.

The implementation introduces a series of new frameworks and components designed to support a variety of checkpoint and restart techniques. This allows us to support the methods described above (application-directed, BLCR, etc.) as well as other kinds of checkpoint/restart systems (e.g., Condor, libckpt) and protocols (e.g., uncoordinated, message induced).

Note: The checkpoint/restart support was last released as part of the v1.6 series. The v1.7 series and the Open MPI main do not support this functionality (most of the code is present in the repository, but it is known to be non-functional in most cases). This feature is looking for a maintainer. Interested parties should inquire on the developers mailing list.

50. Where can I find the fault tolerance development work?

The only active work in resilience in Open MPI targets the User Level Fault Mitigation (ULFM) approach, a technique discussed in the context of the MPI standardization body.

For information on the Fault Tolerant MPI prototype in Open MPI see the links below:

MPI Forum's Fault Tolerance Working Group: https://github.com/mpiwg-ft/ft-issues/wiki
Fault Tolerant MPI Prototype: development hosted on https://bitbucket.org/icldistcomp/ulfm2/ and information and support can be found at http://fault-tolerance.org/

Support for other types of resilience (data reliability, checkpoint) has been deprecated over the years due to lack of adoption and lack of maintanance. If you are interested in doing some archeological work, traces are still available on the main repository.

51. Does Open MPI support end-to-end data reliability in MPI message passing?

Current OMPI releases have no support for end-to-end data reliability, at least not more than currently provided by the underlying network.

The data reliability ("dr") PML component available on some past releases has been deprecated), assumed that the underlying network is unreliable. It could drop / restart connections, retransmit corrupted or lost data, etc. The end effect is that data sent through MPI API functions will be guaranteed to be reliable.

For example, if you're using TCP as a message transport, chances of data corruption are fairly low. However, other interconnects do not guarantee that data will be uncorrupted when traveling across the network. Additionally, there are nonzero possibilities that data can be corrupted while traversing PCI buses, etc. (some corruption errors at this level can be caught/fixed, others cannot). Such errors are not uncommon at high altitudes (!).

Note that such added reliability does incur a performance cost — latency and bandwidth suffer when Open MPI performs the consistency checks that are necessary to provide such guarantees.

Most clusters/networks do not need data reliability. But some do (e.g., those operating at high altitudes). The dr PML was intended for these rare environments where reliability was an issue; and users were willing to tolerate slightly slower applications in order to guarantee that their job does not crash (or worse, produce wrong answers).

52. How do I build Open MPI?

If you have obtained a developer's checkout from Git, skip this FAQ question and consult these directions.

For everyone else, in general, all you need to do is expand the tarball, run the provided configure script, and then run "[make all install]". For example:

shell$ gunzip -c openmpi-5.0.3.tar.gz | tar xf -
shell$ cd openmpi-5.0.3
shell$ ./configure --prefix=/usr/local
<...lots of output...>
shell$ make all install

Note that the configure script supports a lot of different command line options. For example, the --prefix option in the above example tells Open MPI to install under the directory /usr/local/.

Other notable configure options are required to support specific network interconnects and back-end run-time environments. More generally, Open MPI supports a wide variety of hardware and environments, but it sometimes needs to be told where support libraries and header files are located.

Consult the README file in the Open MPI tarball and the output of "configure --help" for specific instructions regarding Open MPI's configure command line options.

53. Wow — I see a lot of errors during configure. Is that normal?

If configure finishes successfully — meaning that it generates a bunch of Makefiles at the end — then yes, it is completely normal.

The Open MPI configure script tests for a lot of things, not all of which are expected to succeed. For example, if you do not have Myrinet's GM library installed, you'll see failures about trying to find the GM library. You'll also see errors and warnings about various operating-system specific tests that are not aimed that the operating system you are running.

These are all normal, expected, and nothing to be concerned about. It just means, for example, that Open MPI will not build Myrinet GM support.

54. What are the default build options for Open MPI?

If you have obtained a developer's checkout from Git, you must consult these directions.

The default options for building an Open MPI tarball are:

Compile Open MPI with all optimizations enabled
Build shared libraries
Build components as standalone dynamic shared object (DSO) files (i.e., run-time plugins)
Try to find support for all hardware and environments by looking for support libraries and header files in standard locations; skip them if not found

Open MPI's configure script has a large number of options, several of which are of the form --with-<FOO>(=DIR), usually with a corresponding --with-<FOO>-libdir=DIR option. The (=DIR) part means that specifying the directory is optional. Here are some examples (explained in more detail below):

--with-openib(=DIR) and --with-openib-libdir=DIR
--with-mx(=DIR) and --with-mx-libdir=DIR
--with-psm(=DIR) and --with-psm-libdir=DIR
...etc.

As mentioned above, by default, Open MPI will try to build support for every feature that it can find on your system. If support for a given feature is not found, Open MPI will simply skip building support for it (this usually means not building a specific plugin).

"Support" for a given feature usually means finding both the relevant header and library files for that feature. As such, the command-line switches listed above are used to override default behavior and allow specifying whether you want support for a given feature or not, and if you do want support, where the header files and/or library files are located (which is useful if they are not located in compiler/linker default search paths). Specifically:

If --without-<FOO> is specified, Open MPI will not even look for support for feature FOO. It will be treated as if support for that feature was not found (i.e., it will be skipped).
If --with-<FOO> is specified with no optional directory, Open MPI's configure script will abort if it cannot find support for the FOO feature. More specifically, only compiler/linker default search paths will be searched while looking for the relevant header and library files. This option essentially tells Open MPI, "Yes, I want support for FOO -- it is an error if you don't find support for it."
If --with-<FOO>=/some/path is specified, it is essentially the same as specifying --with-<FOO> but also tells Open MPI to add -I/some/path/include to compiler search paths, and try (in order) adding -L/some/path/lib and -L/some/path/lib64 to linker search paths when searching for FOO support. If found, the relevant compiler/linker paths are added to Open MPI's general build flags. This option is helpful when support for feature FOO is not found in default search paths.
If --with-<FOO>-libdir=/some/path/lib is specified, it only specifies that if Open MPI searches for FOO support, it should use /some/path/lib for the linker search path.

In general, it is usually sufficient to run Open MPI's configure script with no --with-<FOO> options if all the features you need supported are in default compiler/linker search paths. If the features you need are not in default compiler/linker search paths, you'll likely need to specify --with-<FOO> kinds of flags. However, note that it is safest to add --with-<FOO> types of flags if you want to guarantee that Open MPI builds support for feature FOO, regardless of whether support for FOO can be found in default compiler/linker paths or not — configure will abort if you can't find the appropriate support for FOO. *This may be preferable to unexpectedly discovering at run-time that Open MPI is missing support for a critical feature.*

Be sure to note the difference in the directory specification between --with-<FOO> and --with-<FOO>-libdir. The former takes a top-level directory (such that "/include", "/lib", and "/lib64" are appended to it) while the latter takes a single directory where the library is assumed to exist (i.e., nothing is suffixed to it).

Finally, note that starting with Open MPI v1.3, configure will sanity check to ensure that any directory given to --with-<FOO> or --with-<FOO>-libdir actually exists and will error if it does not. This prevents typos and mistakes in directory names, and prevents Open MPI from accidentally using a compiler/linker-default path to satisfy FOO's header and library files.

55. Open MPI was pre-installed on my machine; should I overwrite it with a new version?

Probably not.

Many systems come with some version of Open MPI pre-installed (e.g., many Linuxes, BSD variants, and OS X. If you download a newer version of Open MPI from this web site (or one of the Open MPI mirrors), you probably do not want to overwrite the system-installed Open MPI. This is because the system-installed Open MPI is typically under the control of some software package management system (rpm, yum, etc.).

Instead, you probably want to install your new version of Open MPI to another path, such as /opt/openmpi-<version> (or whatever is appropriate for your system).

This FAQ entry also has much more information about strategies for where to install Open MPI.

56. Where should I install Open MPI?

A common environment to run Open MPI is in a "Beowulf"-class or similar cluster (e.g., a bunch of 1U servers in a bunch of racks). Simply stated, Open MPI can run on a group of servers or workstations connected by a network. As mentioned above, there are several prerequisites, however (for example, you typically must have an account on all the machines, you can ssh or ssh between the nodes without using a password etc.).

This raises the question for Open MPI system administrators: where to install the Open MPI binaries, header files, etc.? This discussion mainly addresses this question for homogeneous clusters (i.e., where all nodes and operating systems are the same), although elements of this discussion apply to heterogeneous clusters as well. Heterogeneous admins are encouraged to read this discussion and then see the heterogeneous section of this FAQ.

There are two common approaches:

Have a common filesystem, such as NFS, between all the machines to be used. Install Open MPI such that the installation directory is the same value on each node. This will greatly simplify user's shell startup scripts (e.g., .bashrc, .cshrc, .profile etc.) — the PATH can be set without checking which machine the user is on. It also simplifies the system administrator's job; when the time comes to patch or otherwise upgrade OMPI, only one copy needs to be modified.
For example, consider a cluster of four machines: inky, blinky, pinky, and clyde.
- Install Open MPI on inky's local hard drive in the directory /opt/openmpi-5.0.3. The system administrator then mounts inky:/opt/openmpi-5.0.3 on the remaining three machines, such that /opt/openmpi-5.0.3 on all machines is effectively "the same". That is, the following directories all contain the Open MPI installation:
  1 2 3 4
  inky:/opt/openmpi-5.0.3 blinky:/opt/openmpi-5.0.3 pinky:/opt/openmpi-5.0.3 clyde:/opt/openmpi-5.0.3
- Install Open MPI on inky's local hard drive in the directory /usr/local/openmpi-5.0.3. The system administrator then mounts inky:/usr/local/openmpi-5.0.3 on all four machines in some other common location, such as /opt/openmpi-5.0.3 (a symbolic link can be installed on inky instead of a mount point for efficiency). This strategy is typically used for environments where one tree is NFS exported, but another tree is typically used for the location of actual installation. For example, the following directories all contain the Open MPI installation:
  1 2 3 4
  inky:/opt/openmpi-5.0.3 blinky:/opt/openmpi-5.0.3 pinky:/opt/openmpi-5.0.3 clyde:/opt/openmpi-5.0.3
  Notice that there are the same four directories as the previous example, but on inky, the directory is actually located in /usr/local/openmpi-5.0.3.
There is a bit of a disadvantage in this approach; each of the remote nodes have to incur NFS (or whatever filesystem is used) delays to access the Open MPI directory tree. However, both the administration ease and low cost (relatively speaking) of using a networked file system usually greatly outweighs the cost. Indeed, once an MPI application is past MPI_INIT, it doesn't use the Open MPI binaries very much.
NOTE: Open MPI, by default, uses a plugin system for loading functionality at run-time. Most of Open MPI's plugins are opened during the call to MPI_INIT. This can cause a lot of filesystem traffic, which, if Open MPI is installed on a networked filesystem, may be noticable. Two common options to avoid this extra filesystem traffic are to build Open MPI to not use plugins (see this FAQ entry for details) or to install Open MPI locally (see below).

If you are concerned with networked filesystem costs of accessing the Open MPI binaries, you can install Open MPI on the local hard drive of each node in your system. Again, it is highly advisable to install Open MPI in the same directory on each node so that each user's PATH can be set to the same value, regardless of the node that a user has logged on to.
This approach will save some network latency of accessing the Open MPI binaries, but is typically only used where users are very concerned about squeezing every spare cycle out of their machines, or are running at extreme scale where a networked filesystem may get overwhelmed by filesystem requests for Open MPI binaries when running very large parallel jobs.

57. Should I install a new version of Open MPI over an old version?

We do not recommend this.

Before discussing specifics, here are some definitions that are necessary to understand:

Source tree: The tree where the Open MPI source code is located. It is typically the result of expanding an Open MPI distribution source code bundle, such as a tarball.
Build tree: The tree where Open MPI was built. It is always related to a specific source tree, but may actually be a different tree (since Open MPI supports VPATH builds). Specifically, this is the tree where you invoked configure, make, etc. to build and install Open MPI.
Installation tree: The tree where Open MPI was installed. It is typically the "prefix" argument given to Open MPI's configure script; it is the directory from which you run installed Open MPI executables.

In its default configuration, an Open MPI installation consists of several shared libraries, header files, executables, and plugins (dynamic shared objects — DSOs). These installation files act together as a single entity. The specific filenames and contents of these files are subject to change between different versions of Open MPI.

KEY POINT: Installing one version of Open MPI does not uninstall another version.

If you install a new version of Open MPI over an older version, this may not remove or overwrite all the files from the older version. Hence, you may end up with an incompatible muddle of files from two different installations — which can cause problems.

The Open MPI team recommends one of the following methods for upgrading your Open MPI installation:

Install newer versions of Open MPI into a different directory. For example, install into /opt/openmpi-a.b.c and /opt/openmpi-x.y.z for versions a.b.c and x.y.z, respectively.
Completely uninstall the old version of Open MPI before installing the new version. The make uninstall process from Open MPI a.b.c build tree should completely uninstall that version from the installation tree, making it safe to install a new version (e.g., version x.y.z) into the same installation tree.
Remove the old installation directory entirely and then install the new version. For example "rm -rf /opt/openmpi" *(assuming that there is nothing else of value in this tree!)* The installation of Open MPI x.y.z will safely re-create the /opt/openmpi tree. This method is preferable if you no longer have the source and build trees to Open MPI a.b.c available from which to "make uninstall".
Go into the Open MPI a.b.c installation directory and manually remove all old Open MPI files. Then install Open MPI x.y.z into the same installation directory. This can be a somewhat painful, annoying, and error-prone process. We do not recommend it. Indeed, if you no longer have access to the original Open MPI a.b.c source and build trees, it may be far simpler to download Open MPI version a.b.c again from the Open MPI web site, configure it with the same installation prefix, and then run "make uninstall". Or use one of the other methods, above.

58. Can I disable Open MPI's use of plugins?

Yes.

Open MPI uses plugins for much of its functionality. Specifically, Open MPI looks for and loads plugins as dynamically shared objects (DSOs) during the call to MPI_INIT. However, these plugins can be compiled and installed in several different ways:

As DSOs: In this mode (the default), each of Open MPI's plugins are compiled as a separate DSO that is dynamically loaded at run time.
- Advantage: this approach is highly flexible — it gives system developers and administrators fine-grained approach to install new plugins to an existing Open MPI installation, and also allows the removal of old plugins (i.e., forcibly disallowing the use of specific plugins) simply by removing the corresponding DSO(s).
- Disadvantage: this approach causes additional filesystem traffic (mostly during MPI_INIT). If Open MPI is installed on a networked filesystem, this can cause noticeable network traffic when a large parallel job starts, for example.

As part of a larger library: In this mode, Open MPI "slurps up" the plugins and includes them in libmpi (and other libraries). Hence, all plugins are included in the main Open MPI libraries that are loaded by the system linker before an MPI process even starts.
- Advantage: Significantly less filesystem traffic than the DSO approach. This model can be much more performant on network installations of Open MPI.
- Disadvantage: Much less flexible than the DSO approach; system administrators and developers have significantly less ability to add/remove plugins from the Open MPI installation at run-time. Note that you still have some ability to add/remove plugins (see below), but there are limitations to what can be done.

To be clear: Open MPI's plugins can be built either as standalone DSOs or included in Open MPI's main libraries (e.g., libmpi). Additionally, Open MPI's main libraries can be built either as static or shared libraries.

You can therefore choose to build Open MPI in one of several different ways:

--disable-mca-dso: Using the --disable-mca-dso switch to Open MPI's configure script will cause all plugins to be built as part of Open MPI's main libraries — they will not be built as standalone DSOs. However, Open MPI will still look for DSOs in the filesystem at run-time. Specifically: this option significantly decreases (but does not eliminate) filesystem traffic during MPI_INIT, but does allow the flexibility of adding new plugins to an existing Open MPI installation.
Note that the --disable-mca-dso option does not affect whether Open MPI's main libraries are built as static or shared.

--enable-static: Using this option to Open MPI's configure script will cause the building of static libraries (e.g., libmpi.a). This option automatically implies --disable-mca-dso.
Note that --enable-shared is also the default; so if you use --enable-static, Open MPI will build both static and shared libraries that contain all of Open MPI's plugins (i.e., libmpi.so and libmpi.a). If you want only static libraries (that contain all of Open MPI's plugins), be sure to also use --disable-shared.

--disable-dlopen: Using this option to Open MPI's configure script will do two things:
1. Imply --disable-mca-dso, meaning that all plugins will be slurped into Open MPI's libraries.
2. Cause Open MPI to not look for / open any DSOs at run time.
Specifically: this option makes Open MPI not incur any additional filesystem traffic during MPI_INIT. Note that the --disable-dlopen option does not affect whether Open MPI's main libraries are built as static or shared.

59. How do I build an optimized version of Open MPI?

If you have obtained a developer's checkout from Git (or Mercurial), you must consult these directions.

Building Open MPI from a tarball defaults to building an optimized version. There is no need to do anything special.

60. Are VPATH and/or parallel builds supported?

Yes, both VPATH and parallel builds are supported. This allows Open MPI to be built in a different directory than where its source code resides (helpful for multi-architecture builds). Open MPI uses Automake for its build system, so

For example:

shell$ gtar zxf openmpi-1.2.3.tar.gz
shell$ cd openmpi-1.2.3
shell$ mkdir build
shell$ cd build
shell$ ../configure ...
<... lots of output ...>
shell$ make -j 4

Running configure from a different directory from where it actually resides triggers the VPATH build (i.e., it will configure and built itself from the directory where configure was run, not from the directory where configure resides).

Some versions of make support parallel builds. The example above shows GNU make's "-j" option, which specifies how many compile processes may be executing at any given time. We, the Open MPI Team, have found that doubling or quadrupling the number of processors in a machine can significantly speed up an Open MPI compile (since compiles tend to be much more IO bound than CPU bound).

61. Do I need any special tools to build Open MPI?

If you are building Open MPI from a tarball, you need a C compiler, a C++ compiler, and make. If you are building the Fortran 77 and/or Fortran 90 MPI bindings, you will need compilers for these languages as well. You do not need any special version of the GNU "Auto" tools (Autoconf, Automake, Libtool).

If you are building Open MPI from a Git checkout, you need some additional tools. See the source code access pages for more information.

62. How do I build Open MPI as a static library?

As noted above, Open MPI defaults to building shared libraries and building components as dynamic shared objects (DSOs, i.e., run-time plugins). Changing this build behavior is controlled via command line options to Open MPI's configure script.

Building static libraries: You can disable building shared libraries and enable building static libraries with the following options:

1	shell$ ./configure --enable-static --disable-shared ...

Similarly, you can build both static and shared libraries by simply specifying --enable-static (and not specifying --disable-shared), if desired.

Including components in libraries: Instead of building components as DSOs, they can also be "rolled up" and included in their respective libraries (e.g., libmpi). This is controlled with the --enable-mca-static option. Some examples:

1 2	shell$ ./configure --enable-mca-static=pml ... shell$ ./configure --enable-mca-static=pml,btl-openib,btl-self ...

Specifically, entire frameworks and/or individual components can be specified to be rolled up into the library in a comma-separated list as an argument to --enable-mca-static.

63. When I run 'make', it looks very much like the build system is going into a loop.

Open MPI uses the GNU Automake software to build itself. Automake uses a tightly-woven set of file timestamp-based dependencies to compile and link software. This behavior, frequently paired with messages similar to:

1	Warning: File `Makefile.am' has modification time 3.6e+04 s in the future

typically means that you are building on a networked filesystem where the local time of the client machine that you are building on does not match the time on the network filesystem server. This will result in files with incorrect timestamps, and Automake degenerates into undefined behavior.

Two solutions are possible:

Ensure that the time between your network filesystem server and client(s) is the same. This can be accomplished in a variety of ways and is dependent upon your local setup; one method is to use an NTP daemon to synchronize all machines to a common time server.
Build on a local disk filesystem where network timestamps are guaranteed to be synchronized with the local build machine's time.

After using one of the two options, it is likely safest to remove the Open MPI source tree and re-expand the Open MPI tarball. Then you can run configure, make, and make install. Open MPI should then build and install successfully.

64. Configure issues warnings about sed and unterminated commands

Some users have reported seeing warnings like this in the final output from configure:

*** Final output
configure: creating ./config.status
config.status: creating ompi/include/ompi/version.h
sed: file ./confstatA1BhUF/subs-3.sed line 33: unterminated `s' command
sed: file ./confstatA1BhUF/subs-4.sed line 4: unterminated `s' command
config.status: creating orte/include/orte/version.h

These messages usually indicate a problem in the user's local shell configuration. Ensure that when you run a new shell, no output is sent to stdout. For example, if the output of this simple shell script is more than just the hostname of your computer, you need to go check your shell startup files to see where the extraneous output is coming from (and eliminate it):

1
2
3

#!/bin/sh
`hostname`
exit 0

65. Open MPI configured ok, but I get "Makefile:602: *** missing separator" kinds of errors when building

This is usually an indication that configure succeeded but really shouldn't have. See this FAQ entry for one possible cause.

66. Open MPI seems to default to building with the GNU compiler set. Can I use other compilers?

Yes.

Open MPI uses a standard Autoconf "configure" script to probe the current system and figure out how to build itself. One of the choices it makes it which compiler set to use. Since Autoconf is a GNU product, it defaults to the GNU compiler set. However, this is easily overridden on the configure command line. For example, to build Open MPI with the Intel compiler suite:

1	shell$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort ...

Note that you can include additional parameters to configure, implied by the "..." clause in the example above.

In particular, 4 switches on the configure command line are used to specify the compiler suite:

CC: Specifies the C compiler
CXX: Specifies the C++ compiler
F77: Specifies the Fortran 77 compiler
FC: Specifies the Fortran 90 compiler

NOTE: The Open MPI team recommends using a single compiler suite whenever possible. Unexpected or undefined behavior can occur when you mix compiler suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 compilers between different compiler suites is almost guaranteed not to work).

Here are some more examples for common compilers:

# Portland compilers
shell$ ./configure CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90
 
# Pathscale compilers
shell$ ./configure CC=pathcc CXX=pathCC F77=pathf90 FC=pathf90
 
# Oracle Solaris Studio (Sun) compilers
shell$ ./configure CC=cc CXX=CC F77=f77 FC=f90

In all cases, the compilers must be found in your PATH and be able to successfully compile and link non-MPI applications before Open MPI will be able to be built properly.

67. Can I pass specific flags to the compilers / linker used to build Open MPI?

Yes.

Open MPI uses a standard Autoconf configure script to set itself up for building. As such, there are a number of command line options that can be passed to configure to customize flags that are passed to the underlying compiler to build Open MPI:

CFLAGS: Flags passed to the C compiler.
CXXFLAGS: Flags passed to the C++ compiler.
FFLAGS: Flags passed to the Fortran 77 compiler.
FCFLAGS: Flags passed to the Fortran 90 compiler.
LDFLAGS: Flags passed to the linker (not language-specific). This flag is rarely required; Open MPI will usually pick up all LDFLAGS that it needs by itself.
LIBS: Extra libraries to link to Open MPI (not language-specific). This flag is rarely required; Open MPI will usually pick up all LIBS that it needs by itself.
LD_LIBRARY_PATH: Note that we do not recommend setting LD_LIBRARY_PATH via configure, but it is worth noting that you should ensure that your LD_LIBRARY_PATH value is appropriate for your build. Some users have been tripped up, for example, by specifying a non-default Fortran compiler to FC and F77, but then having Open MPI's configure script fail because the LD_LIBRARY_PATH wasn't set properly to point to that Fortran compiler's support libraries.

Note that the flags you specify must be compatible across all the compilers. In particular, flags specified to one language compiler must generate code that can be compiled and linked against code that is generated by the other language compilers. For example, on a 64 bit system where the compiler default is to build 32 bit executables:

1 2	# Assuming the GNU compiler suite shell$ ./configure CFLAGS=-m64 ...

will produce 64 bit C objects, but 32 bit objects for C++, Fortran 77, and Fortran 90. These codes will be incompatible with each other, and Open MPI will build successfully. Instead, you must specify building 64 bit objects for all languages:

1 2	# Assuming the GNU compiler suite shell$ ./configure CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 ...

The above command line will pass "-m64" to all four compilers, and therefore will produce 64 bit objects for all languages.

68. I'm trying to build with the Intel compilers, but Open MPI eventually fails to compile with really long error messages. What do I do?

A common mistake when building Open MPI with the Intel compiler suite is to accidentally specify the Intel C compiler as the C++ compiler. Specifically, recent versions of the Intel compiler renamed the C++ compiler "icpc" (it used to be "icc", the same as the C compiler). Users accustomed to the old name tend to specify "icc" as the C++ compiler, which will then cause a failure late in the Open MPI build process because a C++ code will be compiled with the C compiler. Bad Things then happen.

The solution is to be sure to specify that the C++ compiler is "icpc", not "icc". For example:

1	shell$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort ...

For Googling purposes, here's some of the error messages that may be issued during the Open MPI compilation of C++ codes with the Intel C compiler (icc), in no particular order:

IPO Error: unresolved : _ZNSsD1Ev
IPO Error: unresolved : _ZdlPv
IPO Error: unresolved : _ZNKSs4sizeEv
components.o(.text+0x17): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string()'
components.o(.text+0x64): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string()'
components.o(.text+0x70): In function `ompi_info::open_components()':
: undefined reference to `std::string::size() const'
components.o(.text+0x7d): In function `ompi_info::open_components()':
: undefined reference to `std::string::reserve(unsigned int)'
components.o(.text+0x8d): In function `ompi_info::open_components()':
: undefined reference to `std::string::append(char const*, unsigned int)'
components.o(.text+0x9a): In function `ompi_info::open_components()':
: undefined reference to `std::string::append(std::string const&)'
components.o(.text+0xaa): In function `ompi_info::open_components()':
: undefined reference to `std::string::operator=(std::string const&)'
components.o(.text+0xb3): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()'

There are many more error messages, but the above should be sufficient for someone trying to find this FAQ entry via a web crawler.

69. When I build with the Intel compiler suite, linking user MPI applications with the wrapper compilers results in warning messages. What do I do?

When Open MPI was built with some versions of the Intel compilers on some platforms, you may see warnings similar to the following when compiling MPI applications with Open MPI's wrapper compilers:

1
2
3

shell$ mpicc hello.c -o hello
libimf.so: warning: warning: feupdateenv is not implemented and will always fail
shell$

This warning is generally harmless, but it can be alarming to some users. To remove this warning, pass either the -shared-intel or -i-dynamic options when linking your MPI application (the specific option depends on your version of the Intel compilers; consult your local documentation):

1 2	shell$ mpicc hello.c -o hello -shared-intel shell$

You can also change the default behavior of Open MPI's wrapper compilers to automatically include this -shared-intel flag so that it is unnecessary to specify it on the command line when linking MPI applications.

70. I'm trying to build with the IBM compilers, but Open MPI eventually fails to compile. What do I do?

Unfortunately there are some problems between Libtool (which Open MPI uses for library support) and the IBM compilers when creating shared libraries. Currently the only workaround is to disable shared libraries and build Open MPI statically. For example:

1	shell$ ./configure CC=xlc CXX=xlc++ F77=xlf FC=xlf90 --disable-shared --enable-static ...

For Googling purposes, here's an error message that may be issued when the build fails:

xlc: 1501-216 command option --whole-archive is not recognized - passed to ld
xlc: 1501-216 command option --no-whole-archive is not recognized - passed to ld
xlc: 1501-218 file libopen-pal.so.0 contains an incorrect file suffix
xlc: 1501-228 input file libopen-pal.so.0 not found

71. I'm trying to build with the Oracle Solaris Studio (Sun) compilers on Linux, but Open MPI eventually fails to compile. What do I do?

Below are some known issues that impact Oracle Solaris Studio 12 Open MPI builds. The easiest way to work around them is simply to use the latest version of the Oracle Solaris Studio 12 compilers.

72. What configure options should I use when building with the Oracle Solaris Studio (Sun) compilers?

The below configure options are suggested for use with the Oracle Solaris Studio (Sun) compilers:

--enable-heterogeneous
--enable-cxx-exceptions
--enable-shared
--enable-orterun-prefix-by-default
--enable-mpi-f90
--with-mpi-f90-size=small
--disable-mpi-threads
--disable-progress-threads
--disable-debug

Linux only:

1
2
3

--with-openib
--without-udapl
--disable-openib-ibcm (only in v1.5.4 and earlier)

Solaris x86 only:

1
2

CFLAGS="-xtarget=generic -xarch=sse2 -xprefetch -xprefetch_level=2 -xvector=simd -xdepend=yes -xbuiltin=%all -xO5"
FFLAGS="-xtarget=generic -xarch=sse2 -xprefetch -xprefetch_level=2 -xvector=simd -stackvar -xO5"

Solaris SPARC only:

1
2

CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -xdepend=yes -xbuiltin=%all -xO5"
FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -stackvar -xO5"

73. When building with the Oracle Solaris Studio 12 Update 1 (Sun) compilers on x86 Linux, the compiler loops on btl_sm.c. Is there a workaround?

Apply Sun patch 141859-04.

You may also consider updating your Oracle Solaris Studio compilers to the latest Oracle Solaris Studio Express.

74. How do I build OpenMPI on IBM QS22 cell blade machines with GCC and XLC/XLF compilers?

You can use two following scripts (contributed by IBM) to build Open MPI on QS22.

Script to build OpenMPI using the GCC compiler:

#!/bin/bash
export PREFIX=/usr/local/openmpi-1.2.7_gcc
 
./configure \
        CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m64 \
        CXXFLAGS=-m64 FC=ppu-gfortran  FCFLAGS=-m64 \
        FFLAGS=-m64 CCASFLAGS=-m64 LDFLAGS=-m64 \
        --prefix=$PREFIX \
        --with-platform=optimized \
        --disable-mpi-profile \
        --with-openib=/usr \
        --enable-ltdl-convenience \
        --with-wrapper-cflags=-m64 \
        --with-wrapper-ldflags=-m64 \
        --with-wrapper-fflags=-m64 \
        --with-wrapper-fcflags=-m64
 
make
make install
 
cat <<EOF >> $PREFIX/etc/openmpi-mca-params.conf
mpi_paffinity_alone = 1
mpi_leave_pinned = 1
btl_openib_want_fork_support = 0
EOF
 
cp config.status $PREFIX/config.status

Script to build OpenMPI using XLC and XLF compilers:

#!/bin/bash
#
export PREFIX=/usr/local/openmpi-1.2.7_xl
 
./configure --prefix=$PREFIX \
            --with-platform=optimized \
            --disable-shared --enable-static \
            CC=ppuxlc CXX=ppuxlc++ F77=ppuxlf FC=ppuxlf90 LD=ppuld \
            --disable-mpi-profile \
            --disable-heterogeneous \
            --with-openib=/usr \
            CFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            CXXFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            FFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            FCFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            CCASFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            LDFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --enable-ltdl-convenience \
            --with-wrapper-cflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --with-wrapper-ldflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --with-wrapper-fflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --with-wrapper-fcflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --enable-contrib-no-build=libnbc,vt
 
make
make install
 
cat <<EOF >> $PREFIX/etc/openmpi-mca-params.conf
mpi_paffinity_alone = 1
mpi_leave_pinned = 1
btl_openib_want_fork_support = 0
EOF
 
cp config.status $PREFIX/config.status

75. I'm trying to build with the PathScale 3.0 and 3.1 compilers on Linux, but all Open MPI commands seg fault. What do I do?

The PathScale compiler authors have identified a bug in the v3.0 and v3.1 versions of their compiler; you must disable certain "builtin" functions when building Open MPI:

With PathScale 3.0 and 3.1 compilers use the workaround options -O2 and -fno-builtin in CFLAGS across the Open MPI build. For example:
1
shell$ ./configure CFLAGS="-O2 -fno-builtin" ...
With PathScale 3.2 beta and later, no workaround options are required.

76. All MPI C++ API functions return errors (or otherwise fail) when Open MPI is compiled with the PathScale compilers. What do I do?

This is an old issue that seems to be a problem when PathScale uses a back-end GCC 3.x compiler. Here's a proposed solution from the PathScale support team (from July 2010):

The proposed work-around is to install gcc-4.x on the system and use the pathCC -gnu4 option. Newer versions of the compiler (4.x and beyond) should have this fixed, but we'll have to test to confirm it's actually fixed and working correctly.

We don't anticipate that this will be much of a problem for Open MPI users these days (our informal testing shows that not many users are still using GCC 3.x), but this information is provided so that it is Google-able for those still using older compilers.

77. How do I build Open MPI with support for [my favorite network type]?

To build support for high-speed interconnect networks, you generally only have to specify the directory where its support header files and libraries were installed to Open MPI's configure script. You can specify where multiple packages were installed if you have support for more than one kind of interconnect — Open MPI will build support for as many as it can.

You tell configure where support libraries are with the appropriate --with command line switch. Here is the list of available switches:

--with-libfabric=<dir>: Build support for OpenFabrics Interfaces (OFI), commonly known as "libfabric" (starting with the v1.10 series).
--with-ucx=<dir>: Build support for the UCX library.
--with-mxm=<dir>: Build support for the Mellanox Messaging (MXM) library (starting with the v1.5 series).
--with-verbs=<dir>: Build support for OpenFabrics verbs (previously known as "Open IB", for Infiniband and iWARP networks). NOTE: Up through the v1.6.x series, this option was previously named --with-openib. In the v1.8.x series, it was renamed to be --with-verbs.
--with-portals4=<dir>: Build support for the Portals v4 library (starting with the v1.7 series).
--with-psm=<dir>: Build support for the PSM library.
--with-psm2=<dir>: Build support for the PSM 2 library (starting with the v1.10 series).
--with-usnic: Build support for usNIC networks (starting with the v1.8 series). In the v1.10 series, usNIC support is included in the libfabric library, but this option can still be used to ensure that usNIC support is specifically available.

For example:

1 2	shell$ ./configure --with-ucx=/path/to/ucx/installation \ --with-libfabric=/path/to/libfabric/installation

These switches enable Open MPI's configure script to automatically find all the right header files and libraries to support the various networks that you specified.

You can verify that configure found everything properly by examining its output — it will test for each network's header files and libraries and report whether it will build support (or not) for each of them. Examining configure's output is the first place you should look if you have a problem with Open MPI not correctly supporting a specific network type.

If configure indicates that support for your networks will be included, after you build and install Open MPI, you can run the "ompi_info" command and look for components for your networks. For example:

shell$ ompi_info | egrep ': ofi|ucx'
                 MCA rml: ofi (MCA v2.1.0, API v3.0.0, Component v4.0.0)
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.0.0)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.0)
                 MCA osc: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.0)

Here's some network types that are no longer supported in current versions of Open MPI:

--with-scif=<dir>: Build support for the SCIF library
Last supported in the v3.1.x series.
--with-elan=<dir>: Build support for Elan.
Last supported in the v1.6.x series.
--with-gm=<dir>: Build support for GM (Myrinet).
Last supported in the v1.4.x series.
--with-mvapi=<dir>: Build support for mVAPI (Infiniband).
Last supported in the v1.3 series.
--with-mx=<dir>: Build support for MX (Myrinet).
Last supported in the v1.8.x series.
--with-portals=<dir>: Build support for the Portals library.
Last supported in the v1.6.x series.

78. How do I build Open MPI with support for Slurm / XGrid?

Slurm support is built automatically; there is nothing that you need to do.

XGrid support is built automatically if the XGrid tools are installed.

79. How do I build Open MPI with support for SGE?

Support for SGE first appeared in the Open MPI v1.2 series. The method for configuring it is slightly different between Open MPI v1.2 and v1.3.

For Open MPI v1.2, no extra configure arguments are needed as SGE support is built in automatically. After Open MPI is installed, you should see two components named gridengine.

1
2
3

shell$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.5)
                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.5)

For Open MPI v1.3, you need to explicitly request the SGE support with the "--with-sge" command line switch to the Open MPI configure script. For example:

1	shell$ ./configure --with-sge

After Open MPI is installed, you should see one component named gridengine.

1 2	shell$ ompi_info \| grep gridengine MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

Open MPI v1.3 only has the one specific gridengine component as the other functionality was rolled into other components.

Component versions may vary depending on the version of Open MPI 1.2 or 1.3 you are using.

80. How do I build Open MPI with support for PBS Pro / Open PBS / Torque?

Support for PBS Pro, Open PBS, and Torque must be explicitly requested with the "--with-tm" command line switch to Open MPI's configure script. In general, the procedure is the same building support for high-speed interconnect networks, except that you use --with-tm. For example:

1	shell$ ./configure --with-tm=/path/to/pbs_or_torque/installation

After Open MPI is installed, you should see two components named "tm":

1
2
3

shell$ ompi_info | grep tm
                 MCA pls: tm (MCA v1.0, API v1.0, Component v1.0)
                 MCA ras: tm (MCA v1.0, API v1.0, Component v1.0)

Specific frameworks and version numbers may vary, depending on your version of Open MPI.

NOTE: Update to the note below (May 2006): Torque 2.1.0p0 now includes support for shared libraries and the workarounds listed below are no longer necessary. However, this version of Torque changed other things that require upgrading Open MPI to 1.0.3 or higher (as of this writing, v1.0.3 has not yet been released — nightly snapshot tarballs of what will become 1.0.3 are available at https://www.open-mpi.org/nightly/v1.0/).

NOTE: As of this writing (October 2006), Open PBS, and PBS Pro do not (i.e., they only include static libraries). Because of this, you may run into linking errors when Open MPI tries to create dynamic plugin components for TM support on some platforms. Notably, on at least some 64 bit Linux platforms (e.g., AMD64), trying to create a dynamic plugin that links against a static library will result in error messages such as:

1	relocation R_X86_64_32S against `a local symbol' can not be used when making a shared object; recompile with -fPIC

Note that recent versions of Torque (as of October 2006) have started shipping shared libraries and this issue does not occur.

There are two possible solutions in Open MPI 1.0.x:

Recompile your PBS implementation with "-fPIC" (or whatever the relevant flag is for your compiler to generate position-independent code) and re-install. This will allow Open MPI to generate dynamic plugins with the PBS/Torque libraries properly.
PRO: Open MPI enjoys the benefits of shared libraries and dynamic plugins.
CON: Dynamic plugins can use more memory at run-time (e.g., operating systems tend to align each plugin on a page, rather than densely packing them all into a single library).
CON: This is not possible for binary-only vendor distributions (such as PBS Pro).

Configure Open MPI to build a static library that includes all of its components. Specifically, all of Open MPI's components will be included in its libraries — none will be discovered and opened at run-time. This does not affect user MPI code at all (i.e., the location of Open MPI's plugins is transparent to MPI applications). Use the following options to Open MPI's configure script:
1
shell$ ./configure --disable-shared --enable-static ...
Note that this option only changes the location of Open MPI's _default set_ of plugins (i.e., they are included in libmpi and friends rather than being standalone dynamic shared objects that are found/opened at run-time). This option does not change the fact that Open MPI will still try to open other dynamic plugins at run-time.
PRO: This works with binary-only vendor distributions (e.g., PBS Pro).
CON: User applications are statically linked to Open MPI; if Open MPI — or any of its default set of components — is updated, users will need to re-link their MPI applications.

Both methods work equally well, but there are tradeoffs; each site will likely need to make its own determination of which to use.

81. How do I build Open MPI with support for LoadLeveler?

Support for LoadLeveler will be automatically built if the LoadLeveler libraries and headers are in the default path. If not, support must be explicitly requested with the "--with-loadleveler" command line switch to Open MPI's configure script. In general, the procedure is the same building support for high-speed interconnect networks, except that you use --with-loadleveler. For example:

1	shell$ ./configure --with-loadleveler=/path/to/LoadLeveler/installation

After Open MPI is installed, you should see one or more components named "loadleveler":

1 2	shell$ ompi_info \| grep loadleveler MCA ras: loadleveler (MCA v1.0, API v1.3, Component v1.3)

Specific frameworks and version numbers may vary, depending on your version of Open MPI.

82. How do I build Open MPI with support for Platform LSF?

Note that only Platform LSF 7.0.2 and later is supported.

Support for LSF will be automatically built if the LSF libraries and headers are in the default path. If not, support must be explicitly requested with the "--with-lsf" command line switch to Open MPI's configure script. In general, the procedure is the same building support for high-speed interconnect networks, except that you use --with-lsf. For example:

1	shell$ ./configure --with-lsf=/path/to/lsf/installation

Note: There are some dependencies needed to build with LSF.

1. Network Information Service Version 2, formerly referred to as YP. This is typically found in libnsl, but could vary based on your OS.

- On RHEL: libnsl, libnsl2 AND libnsl2-devel are required.

2. Posix shmem. Can be found in librt on most distros.

After Open MPI is installed, you should see a component named "lsf":

shell$ ompi_info | grep lsf
                 MCA ess: lsf (MCA v2.0, API v1.3, Component v1.3)
                 MCA ras: lsf (MCA v2.0, API v1.3, Component v1.3)
                 MCA plm: lsf (MCA v2.0, API v1.3, Component v1.3)

Specific frameworks and version numbers may vary, depending on your version of Open MPI.

83. How do I build Open MPI with processor affinity support?

Open MPI supports processor affinity for many platforms. In general, processor affinity will automatically be built if it is supported — no additional command line flags to configure should be necessary.

However, Open MPI will fail to build processor affinity if the appropriate support libraries and header files are not available on the system on which Open MPI is being built. Ensure that you have all appropriate "development" packages installed. For example, Red Hat Enterprise Linux (RHEL) systems typically require the numactl-devel packages to be installed before Open MPI will be able to build full support for processor affinity. Other OS's / Linux distros may have different packages that are required.

See this FAQ entry for more details.

84. How do I build Open MPI with memory affinity / NUMA support (e.g., libnuma)?

Open MPI supports memory affinity for many platforms. In general, memory affinity will automatically be built if it is supported — no additional command line flags to configure should be necessary.

However, Open MPI will fail to build memory affinity if the appropriate support libraries and header files are not available on the system on which Open MPI is being built. Ensure that you have all appropriate "development" packages installed. For example, Red Hat Enterprise Linux (RHEL) systems typically require the numactl-devel packages to be installed before Open MPI will be able to build full support for memory affinity. Other OS's / Linux distros may have different packages that are required.

See this FAQ entry for more details.

85. How do I build Open MPI with CUDA-aware support?

CUDA-aware support means that the MPI library can send and receive GPU buffers directly. This feature exists in the Open MPI 1.7 series and later. The support is being continuously updated so different levels of support exist in different versions. We recommend you use the latest 1.8 version for best support.

Configuring the Open MPI 1.8 series and Open MPI 1.7.3, 1.7.4, 1.7.5

With Open MPI 1.7.3 and later the libcuda.so library is loaded dynamically so there is no need to specify a path to it at configure time. Therefore, all you need is the path to the cuda.h header file.

1. Searches in default locations. Looks for cuda.h in /usr/local/cuda/include.

1	shell$ ./configure --with-cuda

2. Searches for cuda.h in /usr/local/cuda-v6.0/cuda/include.

1	shell$ ./configure --with-cuda=/usr/local/cuda-v6.0/cuda

Note that you cannot configure with --disable-dlopen as that will break the ability of the Open MPI library to dynamically load libcuda.so.

Configuring Open MPI 1.7, MPI 1.7.1 and 1.7.2

1
2
3

  --with-cuda(=DIR)       Build cuda support, optionally adding DIR/include,
                          DIR/lib, and DIR/lib64
  --with-cuda-libdir=DIR  Search for cuda libraries in DIR

Here are some examples of configure commands that enable CUDA support.

1. Searches in default locations. Looks for cuda.h in /usr/local/cuda/include and libcuda.so in /usr/lib64.

1	shell$ ./configure --with-cuda

2. Searches for cuda.h in /usr/local/cuda-v4.0/cuda/include and libcuda.so in default location of /usr/lib64.

1	shell$ ./configure --with-cuda=/usr/local/cuda-v4.0/cuda

3. Searches for cuda.h in /usr/local/cuda-v4.0/cuda/include and libcuda.so in /usr/lib64. (Same as previous example.)

1	shell$ ./configure --with-cuda=/usr/local/cuda-v4.0/cuda --with-cuda-libdir=/usr/lib64

If the cuda.h or libcuda.so files cannot be found, then the configure will abort.

Note: There is a bug in Open MPI 1.7.2 such that you will get an error if you configure the library with --enable-static. To get around this error, add the following to your configure line and reconfigure. This disables the build of the PML BFO which is largely unused anyways. This bug is fixed in Open MPI 1.7.3.

1	--enable-mca-no-build=pml-bfo

See this FAQ entry for detals on how to use the CUDA support.

86. How do I not build a specific plugin / component for Open MPI?

The --enable-mca-no-build option to Open MPI's configure script enables you to specify a list of components that you want to skip building. This allows you to not include support for specific features in Open MPI if you do not want to.

It takes a single argument: a comma-delimited list of framework/component pairs inidicating which specific components you do not want to build. For example:

1	shell$ ./configure --enable-mca-no-build=paffinity-linux,timer-solaris

Note that this option is really only useful for components that would otherwise be built. For example, if you are on a machine without Myrinet support, it is not necessary to specify:

1	shell$ ./configure --enable-mca-no-build=btl-gm

because the configure script will naturally see that you do not have support for GM and will automatically skip the gm BTL component.

87. What other options to configure exist?

There are many options to Open MPI's configure script. Please run the following to get a full list (including a short description of each option):

1	shell$ ./configure --help

88. Why does compiling the Fortran 90 bindings take soooo long?

NOTE: Starting with Open MPI v1.7, if you are not using gfortran, building the Fortran 90 and 08 bindings do not suffer the same performance penalty that previous versions incurred. The Open MPI developers encourage all users to upgrade to the new Fortran bindings implementation — including the new MPI-3 Fortran'08 bindings — when possible.

This is actually a design problem with the MPI F90 bindings themselves. The issue is that since F90 is a strongly typed language, we have to overload each function that takes a choice buffer with a typed buffer. For example, MPI_SEND has many different overloaded versions — one for each type of the user buffer. Specifically, there is an MPI_SEND that has the following types for the first argument:

logical*1, logical*2, logical*4, logical*8, logical*16 (if supported)
integer*1, integer*2, integer*4, integer*8, integer*16 (if supported)
real*4, real*8, real*16 (if supported)
complex*8, complex*16, complex*32 (if supported)
character

On the surface, this is 17 bindings for MPI_SEND. Multiply this by every MPI function that takes a choice buffer (50) and you 850 overloaded functions. However, the problem gets worse — for each type, we also have to overload for each array dimension that needs to be supported. Fortran allows up to 7 dimensional arrays, so this becomes (17x7) = 119 versions of every MPI function that has a choice buffer argument. This makes (17x7x50) = 5,950 MPI interface functions.

To make matters even worse, consider the ~25 MPI functions that take 2 choice buffers. Functions have to be provided for all possible combinations of types. This then becomes exponential — the total number of interface functions balloons up to 6.8M.

Additionally, F90 modules must all have their functions in a single source file. Hence, all 6.8M functions must be in one .f90 file and compiled as a single unit (currently, no F90 compiler that we are aware of can handle 6.8M interface functions in a single module).

To limit this problem, Open MPI, by default, does not generate interface functions for any of the 2-buffer MPI functions. Additionally, we limit the maximum number of supported dimensions to 4 (instead of 7). This means that we're generating (17x4*50) = 3,400 interface functions in a single F90 module. So it's far smaller than 6.8M functions, but it's still quite a lot.

This is what makes compiling the F90 module take so long.

Note, however, you can limit the maximum number of dimensions that Open MPI will generate for the F90 bindings with the configure switch --with-f90-max-array-dim=DIM, where DIM is an integer <= 7. The default value is 4. Decreasing this value makes the compilation go faster, but obviously supports fewer dimensions.

Other than this limit on dimension size, there is little else that we can do — the MPI-2 F90 bindings were unfortunately not well thought out in this regard.

Note, however, that the Open MPI team has proposed Fortran '03 bindings for MPI in a paper that was presented at the Euro PVM/MPI'05 conference. These bindings avoid all the scalability problems that are described above and have some other nice properties.

This is something that is being worked on in Open MPI, but there is currently have no estimated timeframe on when it will be available.

89. Does Open MPI support MPI_REAL16 and MPI_COMPLEX32?

It depends. Note that these datatypes are optional in the MPI standard.

Prior to v1.3, Open MPI supported MPI_REAL16 and MPI_COMPLEX32 if a portable C integer type could be found that was the same size (measured in bytes) as Fortran's REAL*16 type. It was later discovered that even though the sizes may be the same, the bit representations between C and Fortran may be different. Since Open MPI's reduction routines are implemented in C, calling MPI_REDUCE (and related functions) with MPI_REAL16 or MPI_COMPLEX32 would generate undefined results (although message passing with these types in homogeneous environments generally worked fine).

As such, Open MPI v1.3 made the test for supporting MPI_REAL16 and MPI_COMPLEX32 more stringent: Open MPI will support these types only if:

An integer C type can be found that has the same size (measured in bytes) as the Fortran REAL*16 type.
The bit representation is the same between the C type and the Fortran type.

Version 1.3.0 only checks for portable C types (e.g., long double). A future version of Open MPI may include support for compiler-specific / non-portable C types. For example, the Intel compiler has specific options for creating a C type that is the same as REAL*16, but we did not have time to include this support in Open MPI v1.3.0.

90. Can I re-locate my Open MPI installation without re-configuring/re-compiling/re-installing from source?

Starting with Open MPI v1.2.1, yes.

Background: Open MPI hard-codes some directory paths in its executables based on installation paths specified by the configure script. For example, if you configure with an installation prefix of /opt/openmpi/, Open MPI encodes in its executables that it should be able to find its help files in /opt/openmpi/share/openmpi.

The "installdirs" functionality in Open MPI lets you change any of these hard-coded directory paths at run time (assuming that you have already adjusted your PATH and/or LD_LIBRARY_PATH environment variables to the new location where Open MPI now resides). There are three methods:

Move an existing Open MPI installation to a new prefix: Set the OPAL_PREFIX environment variable before launching Open MPI. For example, if Open MPI had initially been installed to /opt/openmpi and the entire openmpi tree was later moved to /home/openmpi, setting OPAL_PREFIX to /home/openmpi will enable Open MPI to function properly.

"Stage" an Open MPI installation in a temporary location: When creating self-contained installation packages, systems such as RPM install Open MPI into temporary locations. The package system then bundles up everything under the temporary location into a package that can be installed into its real location later. For example, when creating an RPM that will be installed to /opt/openmpi, the RPM system will transparently prepend a "destination directory" (or "destdir") to the installation directory. As such, Open MPI will think that it is installed in /opt/openmpi, but it is actually temporarily installed in (for example) /var/rpm/build.1234/opt/openmpi. If it is necessary to use Open MPI while it is installed in this staging area, the OPAL_DESTDIR environment variable can be used; setting OPAL_DESTDIR to /var/rpm/build.1234 will automatically prefix every directory such that Open MPI can function properly.

Overriding individual directories: Open MPI uses the GNU-specified directories (per Autoconf/Automake), and can be overridden by setting environment variables directly related to their common names. The list of environment variables that can be used is:
- OPAL_PREFIX
- OPAL_EXEC_PREFIX
- OPAL_BINDIR
- OPAL_SBINDIR
- OPAL_LIBEXECDIR
- OPAL_DATAROOTDIR
- OPAL_DATADIR
- OPAL_SYSCONFDIR
- OPAL_SHAREDSTATEDIR
- OPAL_LOCALSTATEDIR
- OPAL_LIBDIR
- OPAL_INCLUDEDIR
- OPAL_INFODIR
- OPAL_MANDIR
- OPAL_PKGDATADIR
- OPAL_PKGLIBDIR
- OPAL_PKGINCLUDEDIR
Note that not all of the directories listed above are used by Open MPI; they are listed here in entirety for completeness.
Also note that several directories listed above are defined in terms of other directories. For example, the $bindir is defined by default as $prefix/bin. Hence, overriding the $prefix (via OPAL_PREFIX) will automatically change the first part of the $bindir (which is how method 1 described above works). Alternatively, OPAL_BINDIR can be set to an absolute value that ignores $prefix altogether.

91. How do I statically link to the libraries of Intel compiler suite?

The Intel compiler suite, by default, dynamically links its runtime libraries against the Open MPI binaries and libraries. This can cause problems if the Intel compiler libraries are installed in non-standard locations. For example, you might get errors like:

1 2	error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory

To avoid such problems, you can pass flags to Open MPI's configure script that instruct the Intel compiler suite to statically link its runtime libraries with Open MPI:

1	shell$ ./configure CC=icc CXX=icpc FC=ifort LDFLAGS=-Wc,-static-intel ...

92. Why do I get errors about hwloc or libevent not found?

Sometimes you may see errors similar to the following when attempting to build Open MPI:

...
PPFC     profile/pwin_unlock_f08.lo
PPFC     profile/pwin_unlock_all_f08.lo
PPFC     profile/pwin_wait_f08.lo
FCLD     libmpi_usempif08.la
ld: library not found for -lhwloc
collect2: error: ld returned 1 exit status
make[2]: *** [libmpi_usempif08.la] Error 1

This error can happen when a number of factors occur together:

If Open MPI's configure script chooses to use an "external" installation of hwloc and/or Libevent (i.e., outside of Open MPI's source tree).
If Open MPI's configure script chooses C and Fortran compilers from different suites/installations.

Put simply: if the default search library search paths differ between the C and Fortran compiler suites, the C linker may find a system-installed libhwloc and/or libevent, but the Fortran linker may not.

This may tend to happen more frequently starting with Open MPI v4.0.0 on Mac OS because:

In v4.0.0, Open MPI's configure script was changed to "prefer" system-installed versions of hwloc and Libevent (vs. preferring the hwloc and Libevent that are bundled in the Open MPI distribution tarballs).
In MacOS, it is common for Homebrew or MacPorts to install:
- hwloc and/or Libevent
- gcc and gfortran

For example, as of July 2019, Homebrew:

Installs hwloc v2.0.4 under /usr/local
Installs the Gnu C and Fortran compiler suites v9.1.0 under /usr/local. However, the C compiler executable is named gcc-9 (not gcc!), whereas the Fortran compiler executable is named gfortran.

These factors, taken together, result in Open MPI's configure script deciding the following:

The C compiler is gcc (which is the MacOS-installed C compiler).
The Fortran compiler is gfortran (which is the Homebrew-installed Fortran compiler).
There is a suitable system-installed hwloc in /usr/local, which can be found -- by the C compiler/linker -- without specifying any additional linker search paths.

The careful reader will realize that the C and Fortran compilers are from two entirely different installations. Indeed, their default library search paths are different:

The MacOS-installed gcc will search /usr/local/lib by default.
The Homebrew-installed gfortran will not search /usr/local/lib by default.

Hence, since the majority of Open MPI's source code base is in C, it compiles/links against hwloc successfully. But when Open MPI's Fortran code for the mpi_f08 module is compiled and linked, the Homebrew-installed gfortran -- which does not search /usr/local/lib by default -- cannot find libhwloc, and the link fails.

There are a few different possible solutions to this issue:

The best solution is to always ensure that Open MPI uses a C and Fortran compiler from the same suite/installation. This will ensure that both compilers/linkers will use the same default library search paths, and all behavior should be consistent. For example, the following instructs Open MPI's configure script to use gcc-9 for the C compiler, which (as of July 2019) is the Homebrew executable name for its installed C compiler:

shell$ ./configure CC=gcc-9 ...
 
# You can be precise and specify an absolute path for the C
# compiler, and/or also specify the Fortran compiler:
shell$ ./configure CC=/usr/local/bin/gcc-9 FC=/usr/local/bin/gfortran ...

Note that this will likely cause configure to not find the Homebrew-installed hwloc, and instead fall back to using the bundled hwloc in the Open MPI source tree (see this FAQ question for more information about the bundled hwloc and/or Libevent vs. system-installed versions).

Alternatively, you can simply force configure to select the bundled versions of hwloc and libevent, which avoids the issue altogether:
1
shell$ ./configure --with-hwloc=internal --with-libevent=internal ...

Finally, you can tell configure exactly where to find the external hwloc library. This can have some unintended consequences, however, because it will prefix both the C and Fortran linker's default search paths with /usr/local/lib:
1
shell$ ./configure --with-hwloc-libdir=/usr/local/lib ...

Be sure to also see this FAQ question for more information about using the bundled hwloc and/or Libevent vs. system-installed versions.

93. Should I use the bundled hwloc and Libevent, or system-installed versions?

From a performance perspective, there is no significant reason to choose the bundled vs. system-installed hwloc and Libevent installations. Specifically: both will likely give the same performance.

There are other reasons to choose one or the other, however.

First, some background: Open MPI has internally used hwloc and/or Libevent for almost its entire life. Years ago, it was not common for hwloc and/or Libevent to be available on many systems, so the Open MPI community decided to bundle entire copies of the hwloc and Libevent source code in Open MPI distribution tarballs.

This system worked well: Open MPI used the bundled copies of hwloc and Libevent which a) guaranteed that those packages would be available (vs. telling users that they had to separately download/install those packages before installing Open MPI), and b) guaranteed that the versions of hwloc and Libevent were suitable for Open MPI's requirements.

In the last few years, two things have changed:

hwloc and Libevent are now installed on many more systems by default.
The hwloc and Libevent APIs have stabilized such that a wide variety of hwloc/Libevent release versions are suitable for Open MPI's requirements.

While not all systems have hwloc and Libevent available by default (cough cough MacOS cough cough), it is now common enough that -- with the suggestion from Open MPI's downstream packagers -- starting with v4.0.0, Open MPI "prefers" system-installed hwloc and Libevent installations over its own bundled copies.

Meaning: if configure finds a suitable system-installed hwloc and/or Libevent, configure will chose to use those installations instead of the bundled copies in the Open MPI source tree.

That being said, there definitely are obscure technical corner cases and philosophical reasons to force the choice of one or the other. As such, Open MPI provides configure command line options that can be used to specify exact behavior in searching for hwloc and/or Libevent:

--with-hwloc=VALUE: VALUE can be one of the following:
- internal: use the bundled copy of hwloc from Open MPI's source tree.
- external: use an external copy of hwloc (e.g., a system-installed copy), but only use default compiler/linker search paths to find it.
- A directory: use an external copy of hwloc that can be found at dir/include and dir/lib or dir/lib64.
Note that Open MPI requires hwloc -- it is invalid to specify --without-hwloc or --with-hwloc=no. Similarly, it is meaningless to specify --with-hwloc (with no value) or --with-hwloc=yes.
--with-hwloc-libdir=DIR: When used with --with-hwloc=external, default compiler search paths will be used to find hwloc's header files, but DIR will be used to specify the location of the hwloc libraries. This can be necessary, for example, if both 32 and 64 bit versions of the hwloc libraries are available, and default linker search paths would find the "wrong" one.

--with-libevent and --with-libevent-libdir behave the same as the hwloc versions described above, but influence configure's behavior with respect to Libevent, not hwloc.

From Open MPI's perspective, it is always safe to use the bundled copies. If there is ever a problem or conflict, you can specify --with-hwloc=internal and/or --with-libevent=internal, and this will likely solve your problem.

Additionally, note that Open MPI's configure will check some version and functionality aspects from system-installed hwloc / Libevent, and may still choose the bundled copies over system-installed copies (e.g., the system-installed version is too low, the system-installed version is not thread safe, ... etc.).

94. I'm still having problems / my problem is not listed here. What do I do?

Please see this FAQ category for troubleshooting tips and the Getting Help page — it details how to send a request to the Open MPI mailing lists.

95. Why does my MPI application fail to compile, complaining that various MPI APIs/symbols are undefined?

Starting with v4.0.0, Open MPI — by default — removes the prototypes from mpi.h for MPI symbols that were deprecated in 1996 in the MPI-2.0 standard, and finally removed from the MPI-3.0 standard (2012).

Specifically, the following symbols (specified in the MPI language-neutral names) are no longer prototyped in mpi.h by default:

Removed symbol	Replaced with
(click to go to the corresponding FAQ item)	(click to go to the corresponding man page)	Deprecated	Removed

`MPI_ADDRESS`	`MPI_GET_ADDRESS`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_ERRHANDLER_CREATE`	`MPI_COMM_CREATE_ERRHANDLER`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_ERRHANDLER_GET`	`MPI_COMM_GET_ERRHANDLER`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_ERRHANDLER_SET`	`MPI_COMM_SET_ERRHANDLER`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_TYPE_EXTENT`	`MPI_TYPE_GET_EXTENT`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_TYPE_HINDEXED`	`MPI_TYPE_CREATE_HINDEXED`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_TYPE_HVECTOR`	`MPI_TYPE_CREATE_HVECTOR`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_TYPE_LB`	`MPI_TYPE_GET_EXTENT`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_TYPE_STRUCT`	`MPI_TYPE_CREATE_STRUCT`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_TYPE_UB`	`MPI_TYPE_GET_EXTENT`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_LB`	`MPI_TYPE_CREATE_RESIZED`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_UB`	`MPI_TYPE_CREATE_RESIZED`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_COMBINER_HINDEXED_INTEGER`	`MPI_COMBINER_HINDEXED`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_COMBINER_HVECTOR_INTEGER`	`MPI_COMBINER_HVECTOR`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_COMBINER_STRUCT_INTEGER`	`MPI_COMBINER_STRUCT`	MPI-2.0, 1996	MPI-3.0, 2012
`MPI_HANDLER_FUNCTION`	`MPI_COMM_ERRHANDLER_FUNCTION`	MPI-2.0, 1996	MPI-3.0, 2012

Although these symbols are no longer prototyped in mpi.h, _they are still present in the MPI library in Open MPI v4.0.x._ This enables legacy MPI applications to link and run successfully with Open MPI v4.0.x, even though they will fail to compile.

*The Open MPI team strongly encourages all MPI application developers to stop using these constructs that were first deprecated over 20 years ago, and finally removed from the MPI specification in MPI-3.0 (in 2012).* The FAQ items in this category show how to update your application to stop using these removed symbols.

All that being said, if you are unable to immediately update your application to stop using these removed MPI-1 symbols, you can re-enable them in mpi.h by configuring Open MPI with the --enable-mpi1-compatibility flag.

NOTE: Future releases of Open MPI beyond the v4.0.x series may remove these symbols altogether.

96. Why on earth are you breaking the compilation of MPI applications?

The Open MPI developer community decided to take a first step of removing the prototypes for these symbols from mpi.h starting with the Open MPI v4.0.x series for the following reasons:

These symbols have been deprecated since 1996. That's 28 years ago! It's time to start raising awareness for developers who are inadvertantly still using these removed symbols.
The MPI Forum removed these symbols from the MPI-3.0 specification in 2012. This is a sign that the Forum itself recognizes that these removed symbols are no longer needed.
Note that Open MPI did not fully remove these removed symbols: we just made it slightly more painful to get to them. This is an attempt to raise awareness so that MPI application developers can update their applications (it's easy!).

In short: the only way to finally be able to remove these removed symbols from Open MPI someday is to have a "grace period" where the MPI application developers are a) made aware that they are using removed symbols, and b) educated how to update their applications.

We, the Open MPI developers, recognize that your MPI application failing to compile with Open MPI may be a nasty surprise. We apologize for that.

Our intent is simply to use this minor shock to raise awareness and use it as an educational opportunity to show you how to update your application (or direct your friendly neighborhood MPI application developer to this FAQ) to stop using these removed MPI symbols.

Thank you!

97. Why am I getting deprecation warnings when compiling my MPI application?

You are getting deprecation warnings because you are using symbols / functions that are deprecated in MPI. For example:

shell$ mpicc deprecated-example.c -c
deprecated-example.c: In function 'foo':
deprecated-example.c:6:5: warning: 'MPI_Attr_delete' is deprecated: MPI_Attr_delete was deprecated in MPI-2.0; use MPI_Comm_delete_attr instead [-Wdeprecated-declarations]
     MPI_Attr_delete(MPI_COMM_WORLD, 2);
     ^~~~~~~~~~~~~~~
In file included from deprecated-example.c:2:
/usr/local/openmpi/include/mpi.h:2601:20: note: declared here
 OMPI_DECLSPEC  int MPI_Attr_delete(MPI_Comm comm, int keyval)
                    ^~~~~~~~~~~~~~~

Note that the deprecation compiler warnings tells you how to upgrade your code to avoid the deprecation warnings. In this example, it advises you to use MPI_Comm_delete_attr() instead of MPI_Attr_delete().

Also, note that when using --enable-mpi1-compatibility to re-enable removed MPI-1 symbols you will still get compiler warnings when you use the removed symbols. For example:

shell$ mpicc deleted-example.c -c
deleted-example.c: In function 'foo':
deleted-example.c:8:5: warning: 'MPI_Address' is deprecated: MPI_Address was removed in MPI-3.0; use MPI_Get_address instead. [-Wdeleted-declarations]
     MPI_Address(buffer, &address);
     ^~~~~~~~~~~
In file included from deleted-example.c:2:
/usr/local/openmpi/include/mpi.h:2689:20: note: declared here
 OMPI_DECLSPEC  int MPI_Address(void *location, MPI_Aint *address)
                    ^~~~~~~~~~~

98. How do I update my MPI application to stop using MPI_ADDRESS?

In C, the only thing that changed was the function name: MPI_Address() → MPI_Get_address(). Nothing else needs to change:

char buffer[30];
MPI_Aint address;
 
// Old way
MPI_Address(buffer, &address);
 
// New way
MPI_Get_address(buffer, &address);

In Fortran, the type of the parameter changed from INTEGER → INTEGER(KIND=MPI_ADDRESS_KIND) so that it can hold larger values (e.g., 64 bit pointers):

 USE mpi
 REAL buffer
 INTEGER ierror
 INTEGER old_address
 INTEGER(KIND = MPI_ADDRESS_KIND) new_address
 
 ! Old way
 CALL MPI_ADDRESS(buffer, old_address, ierror)
 
 ! New way
 CALL MPI_GET_ADDRESS(buffer, new_address, ierror)

99. How do I update my MPI application to stop using MPI_ERRHANDLER_CREATE?

In C, effectively the only thing that changed was the name of the function: MPI_Errhandler_create() → MPI_Comm_create_errhandler().

Technically, the type of the first parameter also changed ( MPI_Handler_function → MPI_Comm_errhandler_function), but most applications do not use this type directly and may not even notice the change.

void my_errhandler_function(MPI_Comm *comm, int *code, ...)
{
    // Do something useful to handle the error
}
 
void some_function(void)
{
    MPI_Errhandler my_handler;
 
    // Old way
    MPI_Errhandler_create(my_errhandler_function, &my_handler);
 
    // New way
    MPI_Comm_create_errhandler(my_errhandler_function, &my_handler);
}

In Fortran, only the subroutine name changed: MPI_ERRHANDLER_CREATE → MPI_COMM_CREATE_ERRHANDLER.

 USE mpi
 EXTERNAL my_errhandler_function
 INTEGER ierror
 INTEGER my_handler
 
 ! Old way
 CALL MPI_ERRHANDLER_CREATE(my_errhandler_function, my_handler, ierror)
 
 ! Old way
 CALL MPI_COMM_CREATE_ERRHANDLER(my_errhandler_function, my_handler, ierror)

100. How do I update my MPI application to stop using MPI_ERRHANDLER_GET?

In both C and Fortran, the only thing that changed with regards to MPI_ERRHANDLER_GET is the name: MPI_ERRHANDLER_GET → MPI_COMM_GET_ERRHANDLER.

All parameter types stayed the same.

101. How do I update my MPI application to stop using MPI_ERRHANDLER_SET?

In both C and Fortran, the only thing that changed with regards to MPI_ERRHANDLER_SET is the name: MPI_ERRHANDLER_SET → MPI_COMM_SET_ERRHANDLER.

All parameter types stayed the same.

102. How do I update my MPI application to stop using MPI_TYPE_HINDEXED?

In both C and Fortran, effectively the only change is the name of the function: MPI_TYPE_HINDEXED → MPI_TYPE_CREATE_HINDEXED.

In C, the new function also has a const attribute on the two array parameters, but most applications won't notice the difference.

All other parameter types stayed the same.

int count = 2;
int block_lengths[] = { 1, 2 };
MPI_Aint displacements[] = { 0, sizeof(int) };
MPI_Datatype newtype;
 
// Old way
MPI_Type_hindexed(count, block_lengths, displacements, MPI_INT, &newtype);
 
// New way
MPI_Type_create_hindexed(count, block_lengths, displacements, MPI_INT, &newtype);

103. How do I update my MPI application to stop using MPI_TYPE_HVECTOR?

In both C and Fortran, the only change is the name of the function: MPI_TYPE_HVECTOR → MPI_TYPE_CREATE_HVECTOR.

All parameter types stayed the same.

104. How do I update my MPI application to stop using MPI_TYPE_STRUCT?

In both C and Fortran, effectively the only change is the name of the function: MPI_TYPE_STRUCT → MPI_TYPE_CREATE_STRUCT.

In C, the new function also has a const attribute on the three array parameters, but most applications won't notice the difference.

All other parameter types stayed the same.

int count = 2;
int block_lengths[] = { 1, 2 };
MPI_Aint displacements[] = { 0, sizeof(int) };
MPI_Datatype datatypes[] = { MPI_INT, MPI_DOUBLE };
MPI_Datatype newtype;
 
// Old way
MPI_Type_struct(count, block_lengths, displacements, datatypes, &newtype);
 
// New way
MPI_Type_create_struct(count, block_lengths, displacements, datatypes, &newtype);

105. How do I update my MPI application to stop using MPI_TYPE_EXTENT?

In both C and Fortran, the MPI_TYPE_EXTENT function is superseded by the slightly-different MPI_TYPE_GET_EXTENT function: the new function also returns the lower bound.

MPI_Aint lb;
MPI_Aint extent;
 
// Old way
MPI_Type_extent(MPI_INT, &extent);
 
// New way
MPI_Type_get_extent(MPI_INT, &lb, &extent);

106. How do I update my MPI application to stop using MPI_TYPE_LB?

In both C and Fortran, the MPI_TYPE_LB function is superseded by the slightly-different MPI_TYPE_GET_EXTENT function: the new function also returns the extent.

MPI_Aint lb;
MPI_Aint extent;
 
// Old way
MPI_Type_lb(MPI_INT, &lb);
 
// New way
MPI_Type_get_extent(MPI_INT, &lb, &extent);

107. How do I update my MPI application to stop using MPI_TYPE_UB?

In both C and Fortran, the MPI_TYPE_UB function is superseded by the slightly-different MPI_TYPE_GET_EXTENT function: the new function returns the lower bound and the extent, which can be used to compute the upper bound.

MPI_Aint lb, ub;
MPI_Aint extent;
 
// Old way
MPI_Type_ub(MPI_INT, &ub);
 
// New way
MPI_Type_get_extent(MPI_INT, &lb, &extent);
ub = lb + extent

Note the ub calculation after calling MPI_Type_get_extent().

108. How do I update my MPI application to stop using MPI_LB / MPI_UB?

The MPI_LB and MPI_UB positional markers were fully replaced with MPI_TYPE_CREATE_RESIZED in MPI-2.0.

Prior to MPI-2.0, MPI_UB and MPI_LB were intended to be used as input to MPI_TYPE_STRUCT (which, itself, has been deprecated and renamed to MPI_TYPE_CREATE_STRUCT). The same end effect can now be achieved with MPI_TYPE_CREATE_RESIZED. For example, using the old method:

int count = 3;
int block_lengths[] = { 1, 1, 1 };
MPI_Aint displacements[] = { -2, 0, 10 };
MPI_Datatype datatypes[] = { MPI_LB, MPI_INT, MPI_UB };
MPI_Datatype newtype;
 
MPI_Type_struct(count, block_lengths, displacements, datatypes, &newtype);
MPI_Type_commit(&newtype);
 
MPI_Aint ub, lb, extent;
MPI_Type_lb(newtype, &lb);
MPI_Type_ub(newtype, &ub);
MPI_Type_extent(newtype, &extent);
printf("OLD: LB=%d, UB=%d, extent=%d\n",
       lb, ub, extent);

If we run the above, we get an output of:

1	OLD: LB=-2, UB=10, extent=12

The MPI_TYPE_RESIZED function allows us to take any arbitrary datatype and set the lower bound and extent directly (which indirectly sets the upper bound), without needing to setup the arrays and computing the displacements necessary to invoke MPI_TYPE_CREATE_STRUCT.

Aside from the printf statement, the following example is exactly equivalent to the prior example (see this FAQ entry for a mapping of MPI_TYPE_UB to MPI_TYPE_GET_EXTENT):

MPI_Datatype newtype;
 
MPI_Type_create_resized(MPI_INT, -2, 12, &newtype);
MPI_Type_commit(&newtype);
 
MPI_Aint ub, lb, extent;
MPI_Type_get_extent(newtype, &lb, &extent);
ub = lb + extent;
printf("NEW: LB=%d, UB=%d, extent=%d\n",
       lb, ub, extent);

If we run the above, we get an output of:

1	NEW: LB=-2, UB=10, extent=12

109. How do I update my MPI application to stop using MPI_COMBINER_HINDEXED_INTEGER, MPI_COMBINER_HVECTOR_INTEGER, and MPI_COMBINER_STRUCT_INTEGER?

The MPI_COMBINER_HINDEXED_INTEGER, MPI_COMBINER_HVECTOR_INTEGER, and MPI_COMBINER_STRUCT_INTEGER constants could previously be returned from MPI_TYPE_GET_ENVELOPE.

Starting with MPI-3.0, these values will never be returned. Instead, they will just return the same names, but without the _INTEGER suffix. Specifically:

`MPI_COMBINER_HINDEXED_INTEGER`	→	`MPI_COMBINER_HINDEXED`
`MPI_COMBINER_HVECTOR_INTEGER`	→	`MPI_COMBINER_HVECTOR`
`MPI_COMBINER_STRUCT_INTEGER`	→	`MPI_COMBINER_STRUCT`

If your Fortran code is using any of the _INTEGER-suffixed names, you can just delete the _INTEGER suffix.

110. How do I update my MPI application to stop using MPI_Handler_function?

The MPI_Handler_function C type is only used in the deprecated/removed function MPI_Errhandler_create(), as described in this FAQ entry.

Most MPI applications likely won't use this type at all. But if they do, they can simply use the new, exactly-equivalent type name (i.e., the return type, number, and type of parameters didn't change): MPI_Comm_errhandler_function.

void my_errhandler_function(MPI_Comm *comm, int *code, ...)
{
    // Do something useful to handle the error
}
 
void some_function(void)
{
    // Old way
    MPI_Handler_function *old_ptr = my_errhandler_function;
 
    // New way
    MPI_Comm_errhandler_function *new_ptr = my_errhandler_function;
}

The MPI_Handler_function type isn't used at all in the Fortran bindings.

111. In general, how do I build MPI applications with Open MPI?

The Open MPI team strongly recommends that you simply use Open MPI's "wrapper" compilers to compile your MPI applications. That is, instead of using (for example) gcc to compile your program, use mpicc. Open MPI provides a wrapper compiler for four languages:

Language	Wrapper compiler name
C	`mpicc`
C++	`mpiCC`, `mpicxx`, or `mpic++` (note that `mpiCC` will not exist on case-insensitive filesystems)
Fortran	`mpifort` (for v1.7 and above) `mpif77` and `mpif90` (for older versions)

Hence, if you expect to compile your program as:

1	shell$ gcc my_mpi_application.c -o my_mpi_application

Simply use the following instead:

1	shell$ mpicc my_mpi_application.c -o my_mpi_application

Note that Open MPI's wrapper compilers do not do any actual compiling or linking; all they do is manipulate the command line and add in all the relevant compiler / linker flags and then invoke the underlying compiler / linker (hence, the name "wrapper" compiler). More specifically, if you run into a compiler or linker error, check your source code and/or back-end compiler — it is usually not the fault of the Open MPI wrapper compiler.

112. Wait — what is mpifort? Shouldn't I use mpif77 and mpif90?

mpifort is a new name for the Fortran wrapper compiler that debuted in Open MPI v1.7.

It supports compiling all versions of Fortran, and *utilizing all MPI Fortran interfaces* (mpif.h, use mpi, and [use mpi_f08]). There is no need to distinguish between "Fortran 77" (which hasn't existed for 30+ years) or "Fortran 90" — just use mpifort to compile all your Fortran MPI applications and don't worry about what dialect it is, nor which MPI Fortran interface it uses.

Other MPI implementations will also soon support a wrapper compiler named mpifort, so hopefully we can move the whole world to this simpler wrapper compiler name, and eliminate the use of mpif77 and mpif90.

Specifically: mpif77 and mpif90 are deprecated as of Open MPI v1.7. Although mpif77 and mpif90 still exist in Open MPI v1.7 for legacy reasons, they will likely be removed in some (undetermined) future release. It is in your interest to convert to mpifort now.

Also note that these names are literally just sym links to mpifort under the covers. So you're using mpifort whether you realize it or not. :-)

Basically, the 1980's called; they want their mpif77 wrapper compiler back. Let's let them have it.

113. I can't / don't want to use Open MPI's wrapper compilers. What do I do?

We repeat the above statement: the Open MPI Team strongly recommends that you use the wrapper compilers to compile and link MPI applications.

If you find yourself saying, "But I don't want to use wrapper compilers!", please humor us and try them. See if they work for you. Be sure to let us know if they do not work for you.

Many people base their "wrapper compilers suck!" mentality on bad behavior from poorly-implemented wrapper compilers in the mid-1990's. Things are much better these days; wrapper compilers can handle almost any situation, and are far more reliable than you attempting to hard-code the Open MPI-specific compiler and linker flags manually.

That being said, there are some — very, very few — situations where using wrapper compilers can be problematic — such as nesting multiple wrapper compilers of multiple projects. Hence, Open MPI provides a workaround to find out what command line flags you need to compile MPI applications. There are generally two sets of flags that you need: compile flags and link flags.

# Show the flags necessary to compile MPI C applications
shell$ mpicc --showme:compile
 
# Show the flags necessary to link MPI C applications
shell$ mpicc --showme:link

The --showme:* flags work with all Open MPI wrapper compilers (specifically: mpicc, mpiCC / mpicxx / mpic++, mpifort, and if you really must use them, mpif77, mpif90).

Hence, if you need to use some compiler other than Open MPI's wrapper compilers, we advise you to run the appropriate Open MPI wrapper compiler with the --showme flags to see what Open MPI needs to compile / link, and then use those with your compiler.

NOTE: It is absolutely not sufficient to simply add "-lmpi" to your link line and assume that you will obtain a valid Open MPI executable.

NOTE: It is almost never a good idea to hard-code these results in a Makefile (or other build system). It is almost always best to run (for example) "mpicc --showme:compile" in a dynamic fashion to find out what you need. For example, GNU Make allows running commands and assigning their results to variables:

MPI_COMPILE_FLAGS = $(shell mpicc --showme:compile)
MPI_LINK_FLAGS = $(shell mpicc --showme:link)
 
my_app: my_app.c
        $(CC) $(MPI_COMPILE_FLAGS) my_app.c $(MPI_LINK_FLAGS) -o my_app

114. How do I override the flags specified by Open MPI's wrapper compilers? (v1.0 series)

NOTE: This answer applies to the v1.0 series of Open MPI only. If you are using a later series, please see this FAQ entry.

The wrapper compilers each construct command lines in the following form:

1	[compiler] [xCPPFLAGS] [xFLAGS] user_arguments [xLDFLAGS] [xLIBS]

Where <compiler> is replaced by the default back-end compiler for each language, and "x" is customized for each language (i.e., C, C++, F77, and F90).

By setting appropriate environment variables, a user can override default values used by the wrapper compilers. The table below lists the variables for each of the wrapper compilers; the Generic set applies to any wrapper compiler if the corresponding wrapper-specific variable is not set. For example, the value of $OMPI_LDFLAGS will be used with mpicc only if $OMPI_MPICC_LDFLAGS is not set.

Wrapper Compiler	Compiler	Preprocessor Flags	Compiler Flags	Linker Flags	Linker Library Flags
Generic		`OMPI_CPPFLAGS` `OMPI_CXXPPFLAGS` `OMPI_F77PPFLAGS` `OMPI_F90PPFLAGS`	`OMPI_CFLAGS` `OMPI_CXXFLAGS` `OMPI_F77FLAGS` `OMPI_F90FLAGS`	`OMPI_LDFLAGS`	`OMPI_LIBS`
mpicc	`OMPI_MPICC`	`OMPI_MPICC_CPPFLAGS`	`OMPI_MPICC_CFLAGS`	`OMPI_MPICC_LDFLAGS`	`OMPI_MPICC_LIBS`
mpicxx	`OMPI_MPICXX`	`OMPI_MPICXX_CXXPPFLAGS`	`OMPI_MPICXX_CXXFLAGS`	`OMPI_MPICXX_LDFLAGS`	`OMPI_MPICXX_LIBS`
mpif77	`OMPI_MPIF77`	`OMPI_MPIF77_F77PPFLAGS`	`OMPI_MPIF77_F77FLAGS`	`OMPI_MPIF77_LDFLAGS`	`OMPI_MPIF77_LIBS`
mpif90	`OMPI_MPIF90`	`OMPI_MPIF90_F90PPFLAGS`	`OMPI_MPIF90_F90FLAGS`	`OMPI_MPIF90_LDFLAGS`	`OMPI_MPIF90_LIBS`

NOTE: If you set a variable listed above, Open MPI will entirely replace the default value that was originally there. Hence, it is advisable to only replace these values when absolutely necessary.

115. How do I override the flags specified by Open MPI's wrapper compilers? (v1.1 series and beyond)

NOTE: This answer applies to the v1.1 and later series of Open MPI only. If you are using the v1.0 series, please see this FAQ entry.

The Open MPI wrapper compilers are driven by text files that contain, among other things, the flags that are passed to the underlying compiler. These text files are generated automatically for Open MPI and are customized for the compiler set that was selected when Open MPI was configured; it is not recommended that users edit these files.

Note that changing the underlying compiler may not work at all. For example, C++ and Fortran compilers are notoriously binary incompatible with each other (sometimes even within multiple releases of the same compiler). If you compile/install Open MPI with C++ compiler XYZ and then use the OMPI_CXX environment variable to change the mpicxx wrapper compiler to use the ABC C++ compiler, your application code may not compile and/or link. The traditional method of using multiple different compilers with Open MPI is to install Open MPI multiple times; each installation should be built/installed with a different compiler. This is annoying, but it is beyond the scope of Open MPI to be able to fix.

However, there are cases where it may be necessary or desirable to edit these files and add to or subtract from the flags that Open MPI selected. These files are installed in $pkgdatadir, which defaults to $prefix/share/openmpi/<wrapper_name>-wrapper-data.txt. A few environment variables are available for run-time replacement of the wrapper's default values (from the text files):

Wrapper Compiler	Compiler	Preprocessor Flags	Compiler Flags	Linker Flags	Linker Library Flags	Data File
Open MPI wrapper compilers
`mpicc`	`OMPI_CC`	`OMPI_CPPFLAGS`	`OMPI_CFLAGS`	`OMPI_LDFLAGS`	`OMPI_LIBS`	`mpicc-wrapper-data.txt`
`mpic++`	`OMPI_CXX`	`OMPI_CPPFLAGS`	`OMPI_CXXFLAGS`	`OMPI_LDFLAGS`	`OMPI_LIBS`	`mpic++-wrapper-data.txt`
`mpiCC`	`OMPI_CXX`	`OMPI_CPPFLAGS`	`OMPI_CXXFLAGS`	`OMPI_LDFLAGS`	`OMPI_LIBS`	`mpiCC-wrapper-data.txt`
`mpifort`	`OMPI_FC`	`OMPI_CPPFLAGS`	`OMPI_FCFLAGS`	`OMPI_LDFLAGS`	`OMPI_LIBS`	`mpifort-wrapper-data.txt`
`mpif77` (deprecated as of v1.7)	`OMPI_F77`	`OMPI_CPPFLAGS`	`OMPI_FFLAGS`	`OMPI_LDFLAGS`	`OMPI_LIBS`	`mpif77-wrapper-data.txt`
`mpif90` (deprecated as of v1.7)	`OMPI_FC`	`OMPI_CPPFLAGS`	`OMPI_FCFLAGS`	`OMPI_LDFLAGS`	`OMPI_LIBS`	`mpif90-wrapper-data.txt`
OpenRTE wrapper compilers
`ortecc`	`ORTE_CC`	`ORTE_CPPFLAGS`	`ORTE_CFLAGS`	`ORTE_LDFLAGS`	`ORTE_LIBS`	`ortecc-wrapper-data.txt`
`ortec++`	`ORTE_CXX`	`ORTE_CPPFLAGS`	`ORTE_CXXFLAGS`	`ORTE_LDFLAGS`	`ORTE_LIBS`	`ortec++-wrapper-data.txt`
OPAL wrapper compilers
`opalcc`	`OPAL_CC`	`OPAL_CPPFLAGS`	`OPAL_CFLAGS`	`OPAL_LDFLAGS`	`OPAL_LIBS`	`opalcc-wrapper-data.txt`
`opalc++`	`OPAL_CXX`	`OPAL_CPPFLAGS`	`OPAL_CXXFLAGS`	`OPAL_LDFLAGS`	`OPAL_LIBS`	`opalc++-wrapper-data.txt`

Note that the values of these fields can be directly influenced by passing flags to Open MPI's configure script. The following options are available to configure:

--with-wrapper-cflags: Extra flags to add to CFLAGS when using mpicc.
--with-wrapper-cxxflags: Extra flags to add to CXXFLAGS when using mpiCC.
--with-wrapper-fflags: Extra flags to add to FFLAGS when using mpif77 (this option has disappeared in Open MPI 1.7 and will not return; see this FAQ entry for more details).
--with-wrapper-fcflags: Extra flags to add to FCFLAGS when using mpif90 and mpifort.
--with-wrapper-ldflags: Extra flags to add to LDFLAGS when using any of the wrapper compilers.
--with-wrapper-libs: Extra flags to add to LIBS when using any of the wrapper compilers.

The files cited in the above table are fairly simplistic "key=value" data formats. The following are several fields that are likely to be interesting for end-users:

project_short: Prefix for all environment variables. See below.

compiler_env: Specifies the base name of the environment variable that can be used to override the wrapper's underlying compiler at run-time. The full name of the environment variable is of the form <project_short>_<compiler_env>; see table above.

compiler_flags_env: Specifies the base name of the environment variable that can be used to override the wrapper's compiler flags at run-time. The full name of the environment variable is of the form <project_short>_<compiler_flags_env>; see table above.

compiler: The executable name of the underlying compiler.

extra_includes: Relative to $installdir, a list of directories to also list in the preprocessor flags to find header files.

preprocessor_flags: A list of flags passed to the preprocessor.

compiler_flags: A list of flags passed to the compiler.

linker_flags: A list of flags passed to the linker.

libs: A list of libraries passed to the linker.

required_file: If non-empty, check for the presence of this file before continuing. If the file is not there, the wrapper will abort saying that the language is not supported.

includedir: Directory containing Open MPI's header files. The proper compiler "include" flag is prepended to this directory and added into the preprocessor flags.

libdir: Directory containing Open MPI's library files. The proper compiler "include" flag is prepended to this directory and added into the linker flags.

module_option: This field only appears in mpif90. It is the flag that the Fortran 90 compiler requires to declare where module files are located.

116. How can I tell what the wrapper compiler default flags are?

If the corresponding environment variables are not set, the wrappers will add -I$includedir and -I$includedir/openmpi (which usually map to $prefix/include and $prefix/include/openmpi, respectively) to the xFLAGS area, and add -L$libdir (which usually maps to $prefix/lib) to the xLDFLAGS area.

To obtain the values of the other flags, there are two main methods:

Use the --showme option to any wrapper compiler. For example (lines broken here for readability):

shell$ mpicc prog.c -o prog --showme
gcc -I/path/to/openmpi/include -I/path/to/openmpi/include/openmpi/ompi \
prog.c -o prog -L/path/to/openmpi/lib -lmpi \
-lopen-rte -lopen-pal -lutil -lnsl -ldl -Wl,--export-dynamic -lm

This shows a coarse-grained method for getting the entire command line, but does not tell you what each set of flags are (xFLAGS, xCPPFLAGS, xLDFLAGS, and xLIBS).

Use the ompi_info command. For example:

shell$ ompi_info --all | grep wrapper
   Wrapper extra CFLAGS:
 Wrapper extra CXXFLAGS:
   Wrapper extra FFLAGS:
  Wrapper extra FCFLAGS:
  Wrapper extra LDFLAGS:
     Wrapper extra LIBS: -lutil -lnsl -ldl -Wl,--export-dynamic -lm

This installation is only adding options in the xLIBS areas of the wrapper compilers; all other values are blank (remember: the -I's and -L's are implicit).

Note that the --parsable option can be used to obtain machine-parsable versions of this output. For example:

shell$ ompi_info --all --parsable | grep wrapper:extra
option:wrapper:extra_cflags:
option:wrapper:extra_cxxflags:
option:wrapper:extra_fflags:
option:wrapper:extra_fcflags:
option:wrapper:extra_ldflags:
option:wrapper:extra_libs:-lutil -lnsl  -ldl  -Wl,--export-dynamic -lm

117. Why does "mpicc --showme <some flags>" not show any MPI-relevant flags?

The output of commands similar to the following may be somewhat surprising:

1
2
3

shell$ mpicc -g --showme
gcc -g
shell$

Where are all the MPI-related flags, such as the necessary -I, -L, and -l flags?

The short answer is that these flags are not included in the wrapper compiler's underlying command line unless the wrapper compiler sees a filename argument. Specifically (output artificially wrapped below for readability)

shell$ mpicc -g --showme
gcc -g
shell$ mpicc -g foo.c --showme
gcc -I/opt/openmpi/include/openmpi -I/opt/openmpi/include -g foo.c
-Wl,-u,_munmap -Wl,-multiply_defined,suppress -L/opt/openmpi/lib -lmpi
-lopen-rte -lopen-pal -ldl

The second command had the filename "foo.c" in it, so the wrapper added all the relevant flags. This behavior is specifically to allow behavior such as the following:

shell$ mpicc --version --showme
gcc --version
shell$ mpicc --version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5363)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
shell$

That is, the wrapper compiler does not behave differently when constructing the underlying command line if --showme is used or not. The only difference is whether the resulting command line is displayed or executed.

Hence, this behavior allows users to pass arguments to the underlying compiler without intending to actually compile or link (such as passing --version to query the underlying compiler's version). If the wrapper compilers added more flags in these cases, some underlying compilers emit warnings.

118. Are there ways to just add flags to the wrapper compilers?

Yes!

Open MPI's configure script allows you to add command line flags to the wrappers on a permanent basis. The following configure options are available:

--with-wrapper-cflags=<flags>: These flags are added into the CFLAGS area in the mpicc wrapper compiler.
--with-wrapper-cxxflags=<flags>: These flags are added into the CXXFLAGS area in the mpicxx wrapper compiler.
--with-wrapper-fflags=<flags>: These flags are added into the FFLAGS area in the mpif77 wrapper compiler (this option has disappeared in Open MPI 1.7 and will not return; see this FAQ entry for more details).
--with-wrapper-fcflags=<flags>: These flags are added into the FCFLAGS area in the mpif90 wrapper compiler.
--with-wrapper-ldflags=<flags>: These flags are added into the LDFLAGS area in all the wrapper compilers.
--with-wrapper-libs=<flags>: These flags are added into the LIBS area in all the wrapper compilers.

These configure options can be handy if you have some optional compiler/linker flags that you need both Open MPI and all MPI applications to be compiled with. Rather than trying to get all your users to remember to pass the extra flags to the compiler when compiling their applications, you can specify them with the configure options shown above, thereby silently including them in the Open MPI wrapper compilers — your users will therefore be using the correct flags without ever knowing it.

119. Why don't the wrapper compilers add "-rpath" (or similar) flags by default? (version v1.7.3 and earlier)

NOTE: Starting with Open MPI v1.7.4, the wrapper compilers do include -rpath by default. See this FAQ entry for more information.

The default installation of Open MPI tries very hard to not include any non-essential flags in the wrapper compilers. This is the most conservative setting and allows the greatest flexibility for end-users. If the wrapper compilers started adding flags to support specific features (such as run-time locations for finding the Open MPI libraries), such flags — no matter how useful to some portion of users — would almost certainly break assumptions and functionality for other users.

As a workaround, Open MPI provides several mechanisms for users to manually override the flags in the wrapper compilers:

First and simplest, you can add your own flags to the wrapper compiler command line by simply listing them on the command line. For example:
1
shell$ mpicc my_mpi_application.c -o my_mpi_application -rpath /path/to/openmpi/install/lib
Use the --showme options to the wrapper compilers to dynamically see what flags the wrappers are adding, and modify them as appropiate. See this FAQ entry for more details.
Use environment variables to override the arguments that the wrappers insert. If you are using Open MPI 1.0.x, see this FAQ entry, otherwise see this FAQ entry.
If you are using Open MPI 1.1 or later, you can modify text files that provide the system-wide default flags for the wrapper compilers. See this FAQ entry for more details.
If you are using Open MPI 1.1 or later, you can pass additional flags in to the system-wide wrapper compiler default flags through Open MPI's configure script. See this FAQ entry for more details.

You can use one of more of these methods to insert your own flags (such as -rpath or similar).

120. Why do the wrapper compilers add "-rpath" (or similar) flags by default? (version v1.7.4 and beyond)

Prior to v1.7.4, the Open MPI wrapper compilers did not automatically add -rpath (or similar) flags when linking MPI application executables (for all the reasons in this FAQ entry).

Due to popular user request, Open MPI changed its policy starting with v1.7.4: by default on supported systems, Open MPI's wrapper compilers do insert -rpath (or similar) flags when linking MPI applications. You can see the exact flags added by the --showme functionality described in this FAQ entry.

This behavior can be disabled by configuring Open MPI with the --disable-wrapper-rpath CLI option.

121. Can I build 100% static MPI applications?

Fully static linking is not for the weak, and it is not recommended. But it is possible, with some caveats.

You must have static libraries available for everything that your program links to. This includes Open MPI; you must have used the --enable-static option to Open MPI's configure or otherwise have available the static versions of the Open MPI libraries (note that Open MPI static builds default to including all of its plugins in its libraries — as opposed to having each plugin in its own dynamic shared object file. So all of Open MPI's code will be contained in the static libraries — even what are normally contained in Open MPI's plugins). Note that some popular Linux libraries do not have static versions by default (e.g., libnuma), or require additional RPMs to be installed to get the equivalent libraries.

Open MPI must have been built without a memory manager. This means that Open MPI must have been configured with the --without-memory-manager flag. This is irrelevant on some platforms for which Open MPI does not have a memory manager, but on some platforms it is necessary (Linux). It is harmless to use this flag on platforms where Open MPI does not have a memory manager. Not having a memory manager means that Open MPI's mpi_leave_pinned behavior for OS-bypass networks such as InfiniBand will not work.

On some systems (Linux), you may see linker warnings about some files requiring dynamic libraries for functions such as gethostname and dlopen. These are ok, but do mean that you need to have the shared libraries installed. You can disable all of Open MPI's dlopen behavior (i.e., prevent it from trying to open any plugins) by specifying the --disable-dlopen flag to Open MPI's configure script). This will eliminate the linker warnings about dlopen.

For example, this is how to configure Open MPI to build static libraries on Linux:

1 2	shell$ ./configure --without-memory-manager --without-libnuma \ --enable-static [...your other configure arguments...]

Some systems may have additional constraints about their support libraries that require additional steps to produce working 100% static MPI applications. For example, the libibverbs support library from OpenIB / OFED has its own plugin system (which, by default, won't work with an otherwise-static application); MPI applications need additional compiler/linker flags to be specified to create a working 100% MPI application. See this FAQ entry for the details.

122. Can I build 100% static OpenFabrics / OpenIB / OFED MPI applications on Linux?

Fully static linking is not for the weak, and it is not recommended. But it is possible. First, you must read this FAQ entry.

For an OpenFabrics / OpenIB / OFED application to be built statically, you must have libibverbs v1.0.4 or later (v1.0.4 was released after OFED 1.1, so if you have OFED 1.1, you will manually need to upgrade your libibverbs). Both libibverbs and your verbs hardware plugin must be available in static form.

Once all of that has been setup, run the following (artificially wrapped sample output shown below — your output may be slightly different):

shell$ mpicc your_app.c -o your_app --showme
gcc -I/opt/openmpi/include/openmpi \
-I/opt/openmpi/include -pthread ring.c -o ring \
-L/usr/local/ofed/lib -L/usr/local/ofed/lib64/infiniband \
-L/usr/local/ofed/lib64 -L/opt/openmpi/lib -lmpi -lopen-rte \
-lopen-pal -libverbs -lrt -Wl,--export-dynamic -lnsl -lutil -lm -ldl

(Or use whatever wrapper compiler is relevant — the --showme flag is the important part here.)

This example shows the steps for the GNU compiler suite, but other compilers will be similar. This example also assumes that the OpenFabrics / OpenIB / OFED install was rooted at /usr/local/ofed; some distributions install under /usr/ofed (or elsewhere). Finally, some installations use the library directory "lib64" while others use "lib". Adjust your directory names as appropriate.

Take the output of the above command and run it manually to compile and link your application, adding the following highlighted arguments:

shell$ gcc -static -I/opt/openmpi/include/openmpi \
  -I/opt/openmpi/include -pthread ring.c -o ring \
  -L/usr/local/ofed/lib -L/usr/local/ofed/lib64/infiniband \
  -L/usr/local/ofed/lib64 -L/opt/openmpi/lib -lmpi -lopen-rte \
  -lopen-pal -Wl,--whole-archive -libverbs /usr/local/ofed/lib64/infiniband/mthca.a \
  -Wl,--no-whole-archive -lrt -Wl,--export-dynamic -lnsl -lutil \
  -lm -ldl

Note that the mthca.a file is the verbs plugin for Mellanox HCAs. If you have an HCA from a different vendor (such as IBM or QLogic), use the appropriate filename (look in $ofed_libdir/infiniband for verbs plugin files for your hardware).

Specifically, these added arguments do the following:

-static: Tell the linker to generate a static executable.
-Wl,--whole-archive: Tell the linker to include the entire ibverbs library in the executable.
$ofed_root/lib64/infiniband/mthca.a: Include the Mellanox verbs plugin in the executable.
-Wl,--no-whole-archive: Tell the linker the return to the default of not including entire libraries in the executable.

You can either add these arguments in manually, or you can see this FAQ entry to modify the default behavior of the wrapper compilers to hide this complexity from end users (but be aware that if you modify the wrapper compilers' default behavior, all users will be creating static applications!).

123. Why does it take soooo long to compile F90 MPI applications?

NOTE: Starting with Open MPI v1.7, if you are not using gfortran, building the Fortran 90 and 08 bindings do not suffer the same performance penalty that previous versions incurred. The Open MPI developers encourage all users to upgrade to the new Fortran bindings implementation — including the new MPI-3 Fortran'08 bindings — when possible.

This is unfortunately due to a design flaw in the MPI F90 bindings themselves.

The answer to this question is exactly the same as it is for why it takes so long to compile the MPI F90 bindings in the Open MPI implementation; please see this FAQ entry for the details.

124. How do I build BLACS with Open MPI?

The blacs_install.ps file (available from that web site) describes how to build BLACS, so we won't repeat much of it here (especially since it might change in future versions). These instructions only pertain to making Open MPI work correctly with BLACS.

After selecting the appropriate starting Bmake.inc, make the following changes to Sections 1, 2, and 3. The example below is from the Bmake.MPI-SUN4SOL2; your Bmake.inc file may be different.

# Section 1:
# Ensure to use MPI for the communication layer
 
   COMMLIB = MPI
 
# The MPIINCdir macro is used to link in mpif.h and
# must contain the location of Open MPI's mpif.h.
# The MPILIBdir and MPILIB macros are irrelevant
# and should be left empty.
 
   MPIdir = /path/to/openmpi-5.0.3
   MPILIBdir =
   MPIINCdir = $(MPIdir)/include
   MPILIB =
 
# Section 2:
# Set these values:
 
   SYSINC =
   INTFACE = -Df77IsF2C
   SENDIS =
   BUFF =
   TRANSCOMM = -DUseMpi2
   WHATMPI =
   SYSERRORS =
 
# Section 3:
# You may need to specify the full path to
# mpif77 / mpicc if they aren't already in
# your path.
 
   F77            = mpif77
   F77LOADFLAGS   =
 
   CC             = mpicc
   CCLOADFLAGS    =

The remainder of the values are fairly obvious and irrelevant to Open MPI; you can set whatever optimization level you want, etc.

If you follow the rest of the instructions for building, BLACS will build correctly and use Open MPI as its MPI communication layer.

125. How do I build ScaLAPACK with Open MPI?

The scalapack_install.ps file (available from that web site) describes how to build ScaLAPACK, so we won't repeat much of it here (especially since it might change in future versions). These instructions only pertain to making Open MPI work correctly with ScaLAPACK. These instructions assume that you have built and installed BLACS with Open MPI.

# Make sure you follow the instructions to build BLACS with Open MPI,
# and put its location in the following.
 
   BLACSdir      = ...path where you installed BLACS...
 
# The MPI section is commented out.  Uncomment it. The wrapper
# compiler will handle SMPLIB, so make it blank. The rest are correct
# as is.
 
   USEMPI        = -DUsingMpiBlacs
   SMPLIB        =
   BLACSFINIT    = $(BLACSdir)/blacsF77init_MPI-$(PLAT)-$(BLACSDBGLVL).a
   BLACSCINIT    = $(BLACSdir)/blacsCinit_MPI-$(PLAT)-$(BLACSDBGLVL).a
   BLACSLIB      = $(BLACSdir)/blacs_MPI-$(PLAT)-$(BLACSDBGLVL).a
   TESTINGdir    = $(home)/TESTING
 
# The PVMBLACS setup needs to be commented out.
 
   #USEMPI        =
   #SMPLIB        = $(PVM_ROOT)/lib/$(PLAT)/libpvm3.a -lnsl -lsocket
   #BLACSFINIT    =
   #BLACSCINIT    =
   #BLACSLIB      = $(BLACSdir)/blacs_PVM-$(PLAT)-$(BLACSDBGLVL).a
   #TESTINGdir    = $(HOME)/pvm3/bin/$(PLAT)
 
# Make sure that the BLASLIB points to the right place.  We built this
# example on Solaris, hence the name below.  The Linux version of the
# library (as of this writing) is blas_LINUX.a.
 
   BLASLIB       = $(LAPACKdir)/blas_solaris.a
 
# You may need to specify the full path to mpif77 / mpicc if they
# aren't already in your path.
 
   F77            = mpif77
   F77LOADFLAGS   =
 
   CC             = mpicc
   CCLOADFLAGS    =

The remainder of the values are fairly obvious and irrelevant to Open MPI; you can set whatever optimization level you want, etc.

If you follow the rest of the instructions for building, ScaLAPACK will build correctly and use Open MPI as its MPI communication layer.

126. How do I build PETSc with Open MPI?

The only special configuration that you need to build PETSc is to ensure that Open MPI's wrapper compilers (i.e., mpicc and mpif77) are in your $PATH before running the PETSc configure.py script.

PETSc should then automatically find Open MPI's wrapper compilers and correctly build itself using Open MPI.

127. How do I build VASP with Open MPI?

The following was reported by an Open MPI user who was able to successfully build and run VASP with Open MPI:

I just compiled the latest VASP v4.6 using Open MPI v1.2.1, ifort v9.1, ACML v3.6.0, BLACS with patch-03 and Scalapack v1.7.5 built with ACML.

I configured Open MPI with --enable-static flag.

I used the VASP supplied makefile.linux_ifc_opt and only corrected the paths to the ACML, scalapack, and BLACS dirs (I didn't lower the optimization to -O0 for mpi.f like I suggested before). The -D's are standard except I get a little better performance with -DscaLAPACK (I tested it with out this option too):

CPP    = $(CPP_) -DMPI  -DHOST="LinuxIFC" -DIFC \
     -Dkind8 -DNGZhalf -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc \
     -DMPI_BLOCK=2000  \
     -Duse_cray_ptr -DscaLAPACK

Also, Blacs and Scalapack used the -D's suggested in the Open MPI FAQ.

128. Are other language / application bindings available for Open MPI?

Other MPI language bindings and application-level programming interfaces have been been written by third parties. Here are a link to some of the available packages:

...we used to maintain a list of links here. But the list changes over time; projects come, and projects go. Your best bet these days is simply to use Google to find MPI bindings and application-level programming interfaces.

129. Why does my legacy MPI application fail to compile with Open MPI v4.0.0 (and beyond)?

Starting with v4.0.0, Open MPI — by default — removes the prototypes for MPI symbols that were deprecated in 1996 and finally removed from the MPI standard in MPI-3.0 (2012).

See this FAQ category for much more information.

130. What prerequisites are necessary for running an Open MPI job?

In general, Open MPI requires that its executables are in your PATH on every node that you will run on and if Open MPI was compiled as dynamic libraries (which is the default), the directory where its libraries are located must be in your LD_LIBRARY_PATH on every node.

Specifically, if Open MPI was installed with a prefix of /opt/openmpi, then the following should be in your PATH and LD_LIBRARY_PATH

1 2	PATH: /opt/openmpi/bin LD_LIBRARY_PATH: /opt/openmpi/lib

Depending on your environment, you may need to set these values in your shell startup files (e.g., .profile, .cshrc, etc.).

NOTE: There are exceptions to this rule — notably the --prefix option to mpirun.

See this FAQ entry for more details on how to add Open MPI to your PATH and LD_LIBRARY_PATH.

Additionally, Open MPI requires that jobs can be started on remote nodes without any input from the keyboard. For example, if using rsh or ssh as the remote agent, you must have your environment setup to allow execution on remote nodes without entering a password or passphrase.

131. What ABI guarantees does Open MPI provide?

Open MPI's versioning and ABI scheme is described here, but is summarized here in this FAQ entry for convenience.

Open MPI provided forward application binary interface (ABI) compatibility for MPI applications starting with v1.3.2. Prior to that version, no ABI guarantees were provided.

NOTE: Prior to v1.3.2, subtle and strange failures are almost guaranteed to occur if applications were compiled and linked against shared libraries from one version of Open MPI and then run with another. The Open MPI team strongly discourages making any ABI assumptions before v1.3.2.

NOTE: ABI for the "use mpi" Fortran interface was inadvertantly broken in the v1.6.3 release, and was restored in the v1.6.4 release. Any Fortran applications that utilize the "use mpi" MPI interface that were compiled and linked against the v1.6.3 release will not be link-time compatible with other releases in the 1.5.x / 1.6.x series. Such applications remain source compatible, however, and can be recompiled/re-linked with other Open MPI releases.

Starting with v1.3.2, Open MPI provides forward ABI compatibility — with respect to the MPI API only — in all versions of a given feature release series and its corresponding super stable series. For example, on a single platform, an MPI application linked against Open MPI v1.3.2 shared libraries can be updated to point to the shared libraries in any successive v1.3.x or v1.4 release and still work properly (e.g., via the LD_LIBRARY_PATH environment variable or other operating system mechanism).

For the v1.5 series, this means that all releases of v1.5.x and v1.6.x will be ABI compatible, per the above definition.

Open MPI reserves the right to break ABI compatibility at new feature release series. For example, the same MPI application from above (linked against Open MPI v1.3.2 shared libraries) will not work with Open MPI v1.5 shared libraries. Similarly, MPI applications compiled/linked against Open MPI 1.6.x will not be ABI compatible with Open MPI 1.7.x

132. Do I need a common filesystem on all my nodes?

No, but it certainly makes life easier if you do.

A common environment to run Open MPI is in a "Beowulf"-class or similar cluster (e.g., a bunch of 1U servers in a bunch of racks). Simply stated, Open MPI can run on a group of servers or workstations connected by a network. As mentioned above, there are several prerequisites, however (for example, you typically must have an account on all the machines, you can rsh or ssh between the nodes without using a password, etc.).

Regardless of whether Open MPI is installed on a shared / networked filesystem or independently on each node, it is usually easiest if Open MPI is available in the same filesystem location on every node. For example, if you install Open MPI to /opt/openmpi-5.0.3 on one node, ensure that it is available in /opt/openmpi-5.0.3 on all nodes.

This FAQ entry has a bunch more information about installation locations for Open MPI.

133. How do I add Open MPI to my PATH and LD_LIBRARY_PATH?

Open MPI must be able to find its executables in your PATH on every node (if Open MPI was compiled as dynamic libraries, then its library path must appear in LD_LIBRARY_PATH as well). As such, your configuration/initialization files need to add Open MPI to your PATH / LD_LIBRARY_PATH properly.

How to do this may be highly dependent upon your local configuration, so you may need to consult with your local system administrator. Some system administrators take care of these details for you, some don't. YMMV. Some common examples are included below, however.

You must have at least a minimum understanding of how your shell works to get Open MPI in your PATH / LD_LIBRARY_PATH properly. Note that Open MPI must be added to your PATH and LD_LIBRARY_PATH in two situations: (1) when you login to an interactive shell, (2) and when you login to non-interactive shells on remote nodes.

If (1) is not configured properly, executables like mpicc will not be found, and it is typically obvious what is wrong. The Open MPI executable directory can manually be added to the PATH, or the user's startup files can be modified such that the Open MPI executables are added to the PATH every login. This latter approach is preferred.

All shells have some kind of script file that is executed at login time to set things like PATH and LD_LIBRARY_PATH and perform other environmental setup tasks. This startup file is the one that needs to be edited to add Open MPI to the PATH and LD_LIBRARY_PATH. Consult the manual page for your shell for specific details (some shells are picky about the permissions of the startup file, for example). The table below lists some common shells and the startup files that they read/execute upon login:

Shell	Interactive login startup file
sh (Bourne shell, or bash named "`sh`")	`.profile`
csh	`.cshrc` followed by `.login`
tcsh	`.tcshrc` if it exists, `.cshrc` if it does not, followed by `.login`
bash	`.bash_profile` if it exists, or `.bash_login` if it exists, or `.profile` if it exists (in that order). Note that some Linux distributions automatically come with `.bash_profile` scripts for users that automatically execute `.bashrc` as well. Consult the bash man page for more information.

If (2) is not configured properly, executables like mpirun will not function properly, and it can be somewhat confusing to figure out (particularly for bash users).

The startup files in question here are the ones that are automatically executed for a non-interactive login on a remote node (e.g., "rsh othernode ps"). Note that not all shells support this, and that some shells use different files for this than listed in (1). Some shells will supersede (2) with (1). That is, fulfilling (2) may automatically fulfill (1). The following table lists some common shells and the startup file that is automatically executed, either by Open MPI or by the shell itself:

Shell	Non-interactive login startup file
sh (Bourne or bash named "`sh`")	This shell does not execute any file automatically, so Open MPI will execute the `.profile` script before invoking Open MPI executables on remote nodes
csh	`.cshrc`
tcsh	`.tcshrc` if it exists, or `.cshrc` if it does not
bash	`.bashrc` if it exists

134. What if I can't modify my PATH and/or LD_LIBRARY_PATH?

There are some situations where you cannot modify the PATH or LD_LIBRARY_PATH — e.g., some ISV applications prefer to hide all parallelism from the user, and therefore do not want to make the user modify their shell startup files. Another case is where you want a single user to be able to launch multiple MPI jobs simultaneously, each with a different MPI implementation. Hence, setting shell startup files to point to one MPI implementation would be problematic.

In such cases, you have two options:

Use mpirun's --prefix command line option (described below).
Modify the wrapper compilers to include directives to include run-time search locations for the Open MPI libraries (see this FAQ entry)

mpirun's --prefix command line option takes as an argument the top-level directory where Open MPI was installed. While relative directory names are possible, they can become ambiguous depending on the job launcher used; using absolute directory names is strongly recommended.

For example, say that Open MPI was installed into /opt/openmpi-5.0.3. You would use the --prefix option like this:

1	shell$ mpirun --prefix /opt/openmpi-5.0.3 -np 4 a.out

This will prefix the PATH and LD_LIBRARY_PATH on both the local and remote hosts with /opt/openmpi-5.0.3/bin and /opt/openmpi-5.0.3/lib, respectively. This is usually unnecessary when using resource managers to launch jobs (e.g., Slurm, Torque, etc.) because they tend to copy the entire local environment — to include the PATH and LD_LIBRARY_PATH — to remote nodes before execution. As such, if PATH and LD_LIBRARY_PATH are set properly on the local node, the resource manager will automatically propagate those values out to remote nodes. The --prefix option is therefore usually most useful in rsh or ssh-based environments (or similar).

Beginning with the 1.2 series, it is possible to make this the default behavior by passing to configure the flag --enable-mpirun-prefix-by-default. This will make mpirun behave exactly the same as "mpirun --prefix $prefix ...", where $prefix is the value given to --prefix in configure.

Finally, note that specifying the absolute pathname to mpirun is equivalent to using the --prefix argument. For example, the following is equivalent to the above command line that uses --prefix:

1	shell$ /opt/openmpi-5.0.3/bin/mpirun -np 4 a.out

135. How do I launch Open MPI parallel jobs?

Similar to many MPI implementations, Open MPI provides the commands mpirun and mpiexec to launch MPI jobs. Several of the questions in this FAQ category deal with using these commands.

Note, however, that these commands are exactly identical. Specifically, they are symbolic links to a common back-end launcher command named orterun (Open MPI's run-time environment interaction layer is named the Open Run-Time Environment, or ORTE — hence orterun).

As such, the rest of this FAQ usually refers only to mpirun, even though the same discussions also apply to mpiexec and orterun (because they are all, in fact, the same command).

136. How do I run a simple SPMD MPI job?

Open MPI provides both mpirun and mpiexec commands. A simple way to start a single program, multiple data (SPMD) application in parallel is:

1	shell$ mpirun -np 4 my_parallel_application

This starts a four-process parallel application, running four copies of the executable named my_parallel_application.

The rsh starter component accepts the --hostfile (also known as --machinefile) option to indicate which hosts to start the processes on:

shell$ cat my_hostfile
host01.example.com
host02.example.com
shell$ mpirun --hostfile my_hostfile -np 4 my_parallel_application

This command will launch one copy of my_parallel_application on each of host01.example.com and host02.example.com.

More information about the --hostfile option, and hostfiles in general, is available in this FAQ entry.

Note, however, that not all environments require a hostfile. For example, Open MPI will automatically detect when it is running in batch / scheduled environments (such as SGE, PBS/Torque, Slurm, and LoadLeveler), and will use host information provided by those systems.

Also note that if using a launcher that requires a hostfile and no hostfile is specified, all processes are launched on the local host.

137. How do I run an MPMD MPI job?

Both the mpirun and mpiexec commands support multiple program, multiple data (MPMD) style launches, either from the command line or from a file. For example:

1	shell$ mpirun -np 2 a.out : -np 2 b.out

This will launch a single parallel application, but the first two processes will be instances of the a.out executable, and the second two processes will be instances of the b.out executable. In MPI terms, this will be a single MPI_COMM_WORLD, but the a.out processes will be ranks 0 and 1 in MPI_COMM_WORLD, while the b.out processes will be ranks 2 and 3 in MPI_COMM_WORLD.

mpirun (and mpiexec) can also accept a parallel application specified in a file instead of on the command line. For example:

1	shell$ mpirun --app my_appfile

where the file my_appfile contains the following:

# Comments are supported; comments begin with #
# Application context files specify each sub-application in the
# parallel job, one per line.  The first sub-application is the 2
# a.out processes:
-np 2 a.out
# The second sub-application is the 2 b.out processes:
-np 2 b.out

This will result in the same behavior as running a.out and b.out from the command line.

Note that mpirun and mpiexec are identical in command-line options and behavior; using the above command lines with mpiexec instead of mpirun will result in the same behavior.

138. How do I specify the hosts on which my MPI job runs?

There are three general mechanisms:

The --hostfile option to mpirun. Use this option to specify a list of hosts on which to run. Note that for compatibility with other MPI implementations, --machinefile is a synonym for --hostfile. See this FAQ entry for more information about the --hostfile option.
The --host option to mpirun can be used to specify a list of hosts on which to run on the command line. See this FAQ entry for more information about the --host option.
If you are running in a scheduled environment (e.g., in a Slurm, Torque, or LSF job), Open MPI will automatically get the lists of hosts from the scheduler.

NOTE: The specification of hosts using any of the above methods has nothing to do with the network interfaces that are used for MPI traffic. The list of hosts is only used for specifying which hosts on which to launch MPI processes.

139. I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why?

(You should probably also see this FAQ entry, too.)

If you can run ompi_info and possibly even launch MPI processes locally, but fail to launch MPI processes on remote hosts, it is likely that you do not have your PATH and/or LD_LIBRARY_PATH setup properly on the remote nodes.

Specifically, the Open MPI commands usually run properly even if LD_LIBRARY_PATH is not set properly because they encode the Open MPI library location in their executables and search there by default. Hence, running ompi_info (and friends) usually works, even in some improperly setup environments.

However, Open MPI's wrapper compilers do not encode the Open MPI library locations in MPI executables by default (the wrappers only specify a bare minimum of flags necessary to create MPI executables; we consider any flags beyond this bare minimum set a local policy decision). Hence, attempting to launch MPI executables in environments where LD_LIBRARY_PATH is either not set or was set improperly may result in messages about libmpi.so not being found.

You can change Open MPI's wrapper compiler behavior to specify the run-time location of Open MPI's libraries, if you wish.

Depending on how Open MPI was configured and/or invoked, it may even be possible to run MPI applications in environments where PATH and/or LD_LIBRARY_PATH is not set, or is set improperly. This can be desirable for environments where multiple MPI implementations are installed, such as multiple versions of Open MPI.

140. How can I diagnose problems when running across multiple hosts?

In addition to what is mentioned in this FAQ entry, when you are able to run MPI jobs on a single host, but fail to run them across multiple hosts, try the following:

Ensure that your launcher is able to launch across multiple hosts. For example, if you are using ssh, try to ssh to each remote host and ensure that you are not prompted for a password. For example:
1 2
shell$ ssh remotehost hostname remotehost
If you are unable to launch across multiple hosts, check that your SSH keys are setup properly. Or, if you are running in a managed environment, such as in a Slurm, Torque, or other job launcher, check that you have reserved enough hosts, are running in an allocated job, etc.

Ensure that your PATH and LD_LIBRARY_PATH are set correctly on each remote host on which you are trying to run. For example, with ssh:
1 2 3
shell$ ssh remotehost env | grep -i path PATH=...path on the remote host... LD_LIBRARY_PATH=...LD library path on the remote host...
If your PATH or LD_LIBRARY_PATH are not set properly, see this FAQ entry for the correct values. Keep in mind that it is fine to have multiple Open MPI installations installed on a machine; the first Open MPI installation found by PATH and LD_LIBARY_PATH is the one that matters.

Run a simple, non-MPI job across multiple hosts. This verifies that the Open MPI run-time system is functioning properly across multiple hosts. For example, try running the hostname command:
1 2 3 4 5
shell$ mpirun --host remotehost hostname remotehost shell$ mpirun --host remotehost,otherhost hostname remotehost otherhost
If you are unable to run non-MPI jobs across multiple hosts, check for common problems such as:
1. Check your non-interactive shell setup on each remote host to ensure that it is setting up the PATH and LD_LIBRARY_PATH properly.
2. Check that Open MPI is finding and launching the correct version of Open MPI on the remote hosts.
3. Ensure that you have firewalling disabled between hosts (Open MPI opens random TCP and sometimes random UDP ports between hosts in a single MPI job).
4. Try running with the plm_base_verbose MCA parameter at level 10, which will enable extra debugging output to see how Open MPI launches on remote hosts. For example: [mpirun --mca plm_base_verbose 10 --host remotehost hostname]

Now run a simple MPI job across multiple hosts that does not involve MPI communications. The "hello_c" program in the examples directory in the Open MPI distribution is a good choice. This verifies that the MPI subsystem is able to initialize and terminate properly. For example:

1
2
3

shell$ mpirun --host remotehost,otherhost hello_c
Hello, world, I am 0 of 1, (Open MPI v5.0.3, package: Open MPI jsquyres@builder.cisco.com Distribution, ident: 5.0.3, DATE)
Hello, world, I am 1 of 1, (Open MPI v5.0.3, package: Open MPI jsquyres@builder.cisco.com Distribution, ident: 5.0.3, DATE)

If you are unable to run simple, non-communication MPI jobs, this can indicate that your Open MPI installation is unable to initialize properly on remote hosts. Double check your non-interactive login setup on remote hosts.

Now run a simple MPI job across multiple hosts that does does some simple MPI communications. The "ring_c" program in the examples directory in the Open MPI distribution is a good choice. This verifies that the MPI subsystem is able to pass MPI traffic across your network. For example:

shell$ mpirun --host remotehost,otherhost ring_c
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting

If you are unable to run simple MPI jobs across multiple hosts, this may indicate a problem with the network(s) that Open MPI is trying to use for MPI communications. Try limiting the networks that it uses, and/or exploring levels 1 through 3 MCA parameters for the communications module that you are using. For example, if you're using the TCP BTL, see the output of [ompi_info --level 3 --param btl tcp] .

141. When I build Open MPI with the Intel compilers, I get warnings about "orted" or my MPI application not finding libimf.so. What do I do?

The problem is usually because the Intel libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libimf.so, is an Intel compiler library. As such, it is likely that the user did not setup the Intel compiler library in their environment properly on this node.

Double check that you have setup the Intel compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the Intel compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

head_node$ cd $HOME
head_node$ mpicc mpi_hello.c -o mpi_hello
head_node$ ./mpi_hello
Hello world, I am 0 of 1.
head_node$ ssh node2.example.com
 
Welcome to node2.
node2$ ./mpi_hello
Hello world, I am 0 of 1.
node2$ exit
 
head_node$ ssh node2.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory

The above example shows that running a trivial C program compiled by the Intel compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the Intel compiler environment is setup properly for non-interactive logins.

142. When I build Open MPI with the PGI compilers, I get warnings about "orted" or my MPI application not finding libpgc.so. What do I do?

The problem is usually because the PGI libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libpgc.so, is a PGI compiler library. As such, it is likely that the user did not setup the PGI compiler library in their environment properly on this node.

Double check that you have setup the PGI compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the PGI compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

head_node$ cd $HOME
head_node$ mpicc mpi_hello.c -o mpi_hello
head_node$ ./mpi_hello
Hello world, I am 0 of 1.
head_node$ ssh node2.example.com
 
Welcome to node2.
node2$ ./mpi_hello
Hello world, I am 0 of 1.
node2$ exit
 
head_node$ ssh node2.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory

The above example shows that running a trivial C program compiled by the PGI compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the PGI compiler environment is setup properly for non-interactive logins.

143. When I build Open MPI with the PathScale compilers, I get warnings about "orted" or my MPI application not finding libmv.so. What do I do?

The problem is usually because the PathScale libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libmv.so, is a PathScale compiler library. As such, it is likely that the user did not setup the PathScale compiler library in their environment properly on this node.

Double check that you have setup the PathScale compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the PathScale compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

head_node$ cd $HOME
head_node$ mpicc mpi_hello.c -o mpi_hello
head_node$ ./mpi_hello
Hello world, I am 0 of 1.
head_node$ ssh node2.example.com
 
Welcome to node2.
node2$ ./mpi_hello
Hello world, I am 0 of 1.
node2$ exit
 
head_node$ ssh node2.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory

The above example shows that running a trivial C program compiled by the PathScale compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the PathScale compiler environment is setup properly for non-interactive logins.

144. Can I run non-MPI programs with mpirun / mpiexec?

Yes.

Indeed, Open MPI's mpirun and mpiexec are actually synonyms for our underlying launcher named orterun (i.e., the Open Run-Time Environment layer in Open MPI, or ORTE). So you can use mpirun and mpiexec to launch any application. For example:

1	shell$ mpirun -np 2 --host a,b uptime

This will launch a copy of the Unix command uptime on the hosts a and b.

Other questions in the FAQ section deal with the specifics of the mpirun command line interface; suffice it to say that it works equally well for MPI and non-MPI applications.

145. Can I run GUI applications with Open MPI?

Yes, but it will depend on your local setup and may require additional setup.

In short: you will need to have X forwarding enabled from the remote processes to the display where you want output to appear. In a secure environment, you can simply allow all X requests to be shown on the target display and set the DISPLAY environment variable in all MPI processes' environments to the target display, perhaps something like this:

shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +
shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out

However, this technique is not generally suitable for unsecure environments (because it allows anyone to read and write to your display). A slightly more secure way is to only allow X connections from the nodes where your application will be running:

shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +compute1 +compute2 +compute3 +compute4
compute1 being added to access control list
compute2 being added to access control list
compute3 being added to access control list
compute4 being added to access control list
shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out

(assuming that the four nodes you are running on are compute1 through compute4).

Other methods are available, but they involve sophisticated X forwarding through mpirun and are generally more complicated than desirable.

146. Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?

Maybe. But probably not.

Open MPI provides fairly sophisticated stdin / stdout / stderr forwarding. However, it does not work well with curses, ncurses, readline, or other sophisticated I/O packages that generally require direct control of the terminal.

Every application and I/O library is different — you should try to see if yours is supported. But chances are that it won't work.

Sorry. :-(

147. What other options are available to mpirun?

mpirun supports the "--help" option which provides a usage message and a summary of the options that it supports. It should be considered the definitive list of what options are provided.

Several notable options are:

--hostfile: Specify a hostfile for launchers (such as the rsh launcher) that need to be told on which hosts to start parallel applications. Note that for compatibility with other MPI implementations, --machinefile is a synonym for --hostfile.
--host: Specify a host or list of hosts to run on (see this FAQ entry for more details)
--np (or -np): Indicate the number of processes to start.
--mca (or -mca): Set MCA parameters (see the Run-Time Tuning FAQ)
--wdir <directory>: Set the working directory of the started applications. If not supplied, the current working directory is assumed (or $HOME, if the current working directory does not exist on all nodes).
-x <env-variable-name>: The name of an environment variable to export to the parallel application. The -x option can be specified multiple times to export multiple environment variables to the parallel application.

148. How do I use the --hostfile option to mpirun?

The --hostfile option to mpirun takes a filename that lists hosts on which to launch MPI processes.

NOTE: The hosts listed in a hostfile have nothing to do with which network interfaces are used for MPI communication. They are only used to specify on which hosts to launch MPI processes.

Hostfiles my_hostfile are simple text files with hosts specified, one per line. Each host can also specify a default and maximum number of slots to be used on that host (i.e., the number of available processors on that host). Comments are also supported, and blank lines are ignored. For example:

# This is an example hostfile.  Comments begin with #
#
# The following node is a single processor machine:
foo.example.com
 
# The following node is a dual-processor machine:
bar.example.com slots=2
 
# The following node is a quad-processor machine, and we absolutely
# want to disallow over-subscribing it:
yow.example.com slots=4 max-slots=4

slots and max-slots are discussed more in this FAQ entry

Hostfiles works in two different ways:

Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as Slurm, PBS/Torque, SGE, etc.), the hosts provided by the hostfile must be in the already-provided host list. If the hostfile-specified nodes are not in the already-provided host list, mpirun will abort without launching anything.
In this case, hostfiles act like an exclusionary filter — they limit the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.
For example, say that a scheduler job contains hosts node01 through node04. If you run:
1 2 3
shell$ cat my_hosts node03 shell$ mpirun -np 1 --hostfile my_hosts hostname
This will run a single copy of hostname on the host node03. However, if you run:
1 2 3
shell$ cat my_hosts node17 shell$ mpirun -np 1 --hostfile my_hosts hostname
This is an error (because node17 is not listed in my_hosts); mpirun will abort.
Finally, note that in exclusionary mode, processes will only be executed on the hostfile-specified hosts, even if it causes oversubscription. For example:
1 2 3
shell$ cat my_hosts node03 shell$ mpirun -np 4 --hostfile my_hosts hostname
This will launch 4 copies of hostname on host node03.
Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the --hostfile option will be used as the original and final host list.
In this case, --hostfile acts as an inclusionary agent; all --hostfile-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):
1 2 3 4 5
shell$ cat my_hosts node01.example.com node02.example.com node03.example.com shell$ mpirun -np 3 --hostfile my_hosts hostname
This will launch a single copy of hostname on the hosts node01.example.com, node02.example.com, and node03.example.com.

Note, too, that --hostfile is essentially a per-application switch. Hence, if you specify multiple applications (as in an MPMD job), --hostfile can be specified multiple times:

shell$ cat hostfile_1
node01.example.com
shell$ cat hostfile_2
node02.example.com
shell$ mpirun -np 1 --hostfile hostfile_1 hostname : -np 1 --hostfile hostfile_2 uptime
node01.example.com
 06:11:45 up 1 day,  2:32,  0 users,  load average: 21.65, 20.85, 19.84

Notice that hostname was launched on node01.example.com and uptime was launched on host02.example.com.

149. How do I use the --host option to mpirun?

The --host option to mpirun takes a comma-delimited list of hosts on which to run. For example:

1	shell$ mpirun -np 3 --host a,b,c hostname

Will launch one copy of hostname on hosts a, b, and c.

NOTE: The hosts specified by the --host option have nothing to do with which network interfaces are used for MPI communication. They are only used to specify on which hosts to launch MPI processes.

--host works in two different ways:

Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as Slurm, PBS/Torque, SGE, etc.), the hosts provided by the --host option must be in the already-provided host list. If the --host-specified nodes are not in the already-provided host list, mpirun will abort without launching anything.
In this case, the --host option acts like an exclusionary filter — it limits the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.
For example, say that the hostfile my_hosts contains the hosts node1 through node4. If you run:
1
shell$ mpirun -np 1 --hostfile my_hosts --host node3 hostname
This will run a single copy of hostname on the host node3. However, if you run:
1
shell$ mpirun -np 1 --hostfile my_hosts --host node17 hostname
This is an error (because node17 is not listed in my_hosts); mpirun will abort.
Finally, note that in exclusionary mode, processes will only be executed on the --host-specified hosts, even if it causes oversubscription. For example:
1
shell$ mpirun -np 4 --host a uptime
This will launch 4 copies of uptime on host a.
Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the --host option will be used as the original and final host list.
In this case, --host acts as an inclusionary agent; all --host-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):
1
shell$ mpirun -np 3 --host a,b,c hostname
This will launch a single copy of hostname on the hosts a, b, and c.

Note, too, that --host is essentially a per-application switch. Hence, if you specify multiple applications (as in an MPMD job), --host can be specified multiple times:

1	shell$ mpirun -np 1 --host a hostname : -np 1 --host b uptime

This will launch hostname on host a and uptime on host b.

150. How do I control how my processes are scheduled across nodes?

The short version is that if you are not oversubscribing your nodes (i.e., trying to run more processes than you have told Open MPI are available on that node), scheduling is pretty simple and occurs either on a by-slot or by-node round robin schedule. If you're oversubscribing, the issue gets much more complicated — keep reading.

The more complete answer is: Open MPI schedules processes to nodes by asking two questions from each application on the mpirun command line:

How many processes should be launched?
Where should those processes be launched?

The "how many" question is directly answered with the -np switch to mpirun. The "where" question is a little more complicated, and depends on three factors:

The final node list (e.g., after --host exclusionary or inclusionary processing)
The scheduling policy (which applies to all applications in a single job)
The default and maximum number of slots on each host

As briefly mentioned in this FAQ entry, slots are Open MPI's representation of how many processors are available on a given host.

The default number of slots on any machine, if not explicitly specified, is 1 (e.g., if a host is listed in a hostfile by has no corresponding "slots" keyword). Schedulers (such as Slurm, PBS/Torque, SGE, etc.) automatically provide an accurate default slot count.

Max slot counts, however, are rarely specified by schedulers. The max slot count for each node will default to "infinite" if it is not provided (meaning that Open MPI will oversubscribe the node if you ask it to — see more on oversubscribing in this FAQ entry).

Open MPI currently supports two scheduling policies: by slot and by node:

By slot: This is the default scheduling policy, but can also be explicitly requested by using either the --byslot option to mpirun or by setting the MCA parameter rmaps_base_schedule_policy to the string "slot".

In this mode, Open MPI will schedule processes on a node until all of its default slots are exhausted before proceeding to the next node. In MPI terms, this means that Open MPI tries to maximize the number of adjacent ranks in MPI_COMM_WORLD on the same host without oversubscribing that host.

For example:

shell$ cat my-hosts
node0 slots=2 max_slots=20
node1 slots=2 max_slots=20
shell$ mpirun --hostfile my-hosts -np 8 --byslot | sort
Hello World I am rank 0 of 8 running on node0
Hello World I am rank 1 of 8 running on node0
Hello World I am rank 2 of 8 running on node1
Hello World I am rank 3 of 8 running on node1
Hello World I am rank 4 of 8 running on node0
Hello World I am rank 5 of 8 running on node0
Hello World I am rank 6 of 8 running on node1
Hello World I am rank 7 of 8 running on node1

By node: This policy can be requested either by using the --bynode option to mpirun or by setting the MCA parameter rmaps_base_schedule_policy to the string "node".

In this mode, Open MPI will schedule a single process on each node in a round-robin fashion (looping back to the beginning of the node list as necessary) until all processes have been scheduled. Nodes are skipped once their default slot counts are exhausted.

For example:

shell$ cat my-hosts
node0 slots=2 max_slots=20
node1 slots=2 max_slots=20
shell$ mpirun --hostname my-hosts -np 8 --bynode hello | sort
Hello World I am rank 0 of 8 running on node0
Hello World I am rank 1 of 8 running on node1
Hello World I am rank 2 of 8 running on node0
Hello World I am rank 3 of 8 running on node1
Hello World I am rank 4 of 8 running on node0
Hello World I am rank 5 of 8 running on node1
Hello World I am rank 6 of 8 running on node0
Hello World I am rank 7 of 8 running on node1

In both policies, if the default slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will loop through the list of nodes again and try to schedule one more process to each node until all processes are scheduled. Nodes are skipped in this process if their maximum slot count is exhausted. If the maximum slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will abort without launching any processes.

NOTE: This is the scheduling policy in Open MPI because of a long historical precedent in LAM/MPI. However, the scheduling of processes to processors is a component in the RMAPS framework in Open MPI; it can be changed. If you don't like how this scheduling occurs, please let us know.

151. I'm not using a hostfile. How are slots calculated?

If you are using a supported resource manager, Open MPI will get the slot information directly from that entity. If you are using the --host parameter to mpirun, be aware that each instance of a hostname bumps up the internal slot count by one. For example:

1	shell$ mpirun --host node0,node0,node0,node0 ....

This tells Open MPI that host "node0" has a slot count of 4. This is very different than, for example:

1	shell$ mpirun -np 4 --host node0 a.out

This tells Open MPI that host "node0" has a slot count of 1 but you are running 4 processes on it. Specifically, Open MPI assumes that you are oversubscribing the node.

152. Can I run multiple parallel processes on a uniprocessor machine?

Yes.

But be very careful to ensure that Open MPI knows that you are oversubscibing your node! If Open MPI is unaware that you are oversubscribing a node, severe performance degradation can result.

See this FAQ entry for more details on oversubscription.

153. Can I oversubscribe nodes (run more processes than processors)?

Yes.

However, it is critical that Open MPI knows that you are oversubscribing the node, or severe performance degradation can result.

The short explanation is as follows: never specify a number of slots that is more than the available number of processors. For example, if you want to run 4 processes on a uniprocessor, then indicate that you only have 1 slot but want to run 4 processes. For example:

1
2
3

shell$ cat my-hostfile
localhost
shell$ mpirun -np 4 --hostfile my-hostfile a.out

Specifically: do NOT have a hostfile that contains "slots = 4" (because there is only one available processor).

Here's the full explanation:

Open MPI basically runs its message passing progression engine in two modes: aggressive and degraded.

Degraded: When Open MPI thinks that it is in an oversubscribed mode (i.e., more processes are running than there are processors available), MPI processes will automatically run in degraded mode and frequently yield the processor to its peers, thereby allowing all processes to make progress (be sure to see this FAQ entry that describes how degraded mode affects processor and memory affinity).

Aggressive: When Open MPI thinks that it is in an exactly- or under-subscribed mode (i.e., the number of running processes is equal to or less than the number of available processors), MPI processes will automatically run in aggressive mode, meaning that they will never voluntarily give up the processor to other processes. With some network transports, this means that Open MPI will spin in tight loops attempting to make message passing progress, effectively causing other processes to not get any CPU cycles (and therefore never make any progress).

For example, on a uniprocessor node:

1
2
3

shell$ cat my-hostfile
localhost slots=4
shell$ mpirun -np 4 --hostfile my-hostfile a.out

This would cause all 4 MPI processes to run in aggressive mode because Open MPI thinks that there are 4 available processors to use. This is actually a lie (there is only 1 processor — not 4), and can cause extremely bad performance.

154. Can I force Agressive or Degraded performance modes?

Yes.

The MCA parameter mpi_yield_when_idle controls whether an MPI process runs in Aggressive or Degraded performance mode. Setting it to zero forces Aggressive mode; any other value forces Degraded mode (see this FAQ entry to see how to set MCA parameters).

Note that this value only affects the behavior of MPI processes when they are blocking in MPI library calls. It does not affect behavior of non-MPI processes, nor does it affect the behavior of a process that is not inside an MPI library call.

Open MPI normally sets this parameter automatically (see this FAQ entry for details). Users are cautioned against setting this parameter unless you are really, absolutely, positively sure of what you are doing.

155. How do I run with the TotalView parallel debugger?

Generally, you can run Open MPI processes with TotalView as follows:

1	shell$ mpirun --debug ...mpirun arguments...

Assuming that TotalView is the first supported parallel debugger in your path, Open MPI will autmoatically invoke the correct underlying command to run your MPI process in the TotalView debugger. Be sure to see this FAQ entry for details about what versions of Open MPI and TotalView are compatible.

For reference, this underlying command form is the following:

1	shell$ totalview mpirun -a ...mpirun arguments...

So if you wanted to run a 4-process MPI job of your a.out executable, it would look like this:

1	shell$ totalview mpirun -a -np 4 a.out

Alternatively, Open MPI's mpirun offers the "-tv" convenience option which does the same thing as TotalView's "-a" syntax. For example:

1	shell$ mpirun -tv -np 4 a.out

Note that by default, TotalView will stop deep in the machine code of mpirun itself, which is not what most users want. It is possible to get TotalView to recognize that mpirun is simply a "starter" program and should be (effectively) ignored. Specifically, TotalView can be configured to skip mpirun (and mpiexec and orterun) and jump right into your MPI application. This can be accomplished by placing some startup instructions in a TotalView-specific file named $HOME/.tvdrc.

Open MPI includes a sample TotalView startup file that performs this function (see etc/openmpi-totalview.tcl in Open MPI distribution tarballs; it is also installed, by default, to $prefix/etc/openmpi-totalview.tcl in the Open MPI installation). This file can be either copied to $HOME/.tvdrc or sourced from the $HOME/.tvdrc file. For example, placing the following line in your $HOME/.tvdrc (replacing /path/to/openmpi/installation with the proper directory name, of course) will use the Open MPI-provided startup file:

1	source /path/to/openmpi/installation/etc/openmpi-totalview.tcl

156. How do I run with the DDT parallel debugger?

As of August 2015, DDT has built-in startup for MPI applications within its Alinea Forge GUI product. You can simply use the built-in support to launch, monitor, and kill MPI jobs.

If you are using an older version of DDT that does not have this built-in support, keep reading.

If you've used DDT at least once before (to use the configuration wizard to setup support for Open MPI), you can start it on the command line with:

1	shell$ mpirun --debug ...mpirun arguments...

Assuming that you are using Open MPI v1.2.4 or later, and assuming that DDT is the first supported parallel debugger in your path, Open MPI will automatically invoke the correct underlying command to run your MPI process in the DDT debugger. For reference (or if you are using an earlier version of Open MPI), this underlying command form is the following:

1	shell$ ddt -n {nprocs} -start {exe-name}

Note that passing arbitrary arguments to Open MPI's mpirun is not supported with the DDT debugger.

You can also attach to already-running processes with either of the following two syntaxes:

1
2
3

shell$ ddt -attach {hostname1:pid} [{hostname2:pid} ...] {exec-name}
# Or
shell$ ddt -attach-file {filename of newline separated hostname:pid pairs} {exec-name}

DDT can even be configured to operate with cluster/resource schedulers such that it can run on a local workstation, submit your MPI job via the scheduler, and then attach to the MPI job when it starts.

See the official DDT documentation for more details.

157. What launchers are available?

The documentation contained in the Open MPI tarball will have the most up-to-date information, but as of v1.0, Open MPI supports:

BProc versions 3 and 4 (discontinued starting with OMPI v1.3)
Sun Grid Engine (SGE), and the open source Grid Engine (support first introduced in Open MPI v1.2)
PBS Pro, Torque, and Open PBS
LoadLeveler scheduler (full support since 1.1.1)
rsh / ssh
Slurm
LSF
XGrid (discontinued starting with OMPI 1.4)
Yod (Cray XT-3 and XT-4)

158. How do I specify to the rsh launcher to use rsh or ssh?

See this FAQ entry.

159. How do I run with the Slurm and PBS/Torque launchers?

If support for these systems is included in your Open MPI installation (which you can check with the ompi_info command — look for components named "slurm" and/or "tm"), Open MPI will automatically detect when it is running inside such jobs and will just "do the Right Thing."

See this FAQ entry for a description of how to run jobs in Slurm; see this FAQ entry for a description of how to run jobs in PBS/Torque.

160. Can I suspend and resume my MPI job?

See this FAQ entry.

161. How do I run with LoadLeveler?

If support for LoadLeveler is included in your Open MPI installation (which you can check with the ompi_info command — look for components named "loadleveler"), Open MPI will automatically detect when it is running inside such jobs and will just "do the Right Thing."

Specifically, if you execute an mpirun command in a LoadLeveler job, it will automatically determine what nodes and how many slots on each node have been allocated to the current job. There is no need to specify what nodes to run on. Open MPI will then attempt to launch the job using whatever resource is available (on Linux rsh/ssh is used).

For example:

shell$ cat job
#@ output  = job.out
#@ error   = job.err
#@ job_type = parallel
#@ node = 3
#@ tasks_per_node = 4
mpirun a.out
shell$ llsubmit job

This will run 4 MPI process per node on the 3 nodes which were allocated by LoadLeveler for this job.

For users of Open MPI 1.1 series: In version 1.1.0 there exists a problem which will make it so that Open MPI will not be able to determine what nodes are available to it if the job has more than 128 tasks. In the 1.1.x series starting with version 1.1.1., this can be worked around by passing "-mca ras_loadleveler_priority 110" to mpirun. Version 1.2 and above work without any additional flags.

162. How do I load libmpi at runtime?

If you want to load a the shared library libmpi explicitly at runtime either by using dlopen() from C/C ++ or something like the ctypes package from Python, some extra care is required. The default configuration of Open MPI uses dlopen() internally to load its support components. These components rely on symbols available in libmpi. In order to make the symbols in libmpi available to the components loaded by Open MPI at runtime, libmpi must be loaded with the RTLD_GLOBAL option.

In C/C++, this option is specified as the second parameter to the POSIX dlopen(3) function.

When using ctypes with Python, this can be done with the second (optional) parameter to CDLL(). For example (shown below in Mac OS X, where Open MPI's shared library name ends in ".dylib"; other operating systems use other suffixes, such as ".so"):

from ctypes import *
 
mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL)
 
f = pythonapi.Py_GetArgcArgv
argc = c_int()
argv = POINTER(c_char_p)()
f(byref(argc), byref(argv))
mpi.MPI_Init(byref(argc), byref(argv))
 
# Your MPI program here
 
mpi.MPI_Finalize()

Other scripting languages should have similar options when dynamically loading shared libraries.

163. What MPI environmental variables exist?

Beginning with the v1.3 release, Open MPI provides the following environmental variables that will be defined on every MPI process:

OMPI_COMM_WORLD_SIZE - the number of processes in this process's MPI_COMM_WORLD
OMPI_COMM_WORLD_RANK - the MPI rank of this process in MPI_COMM_WORLD
OMPI_COMM_WORLD_LOCAL_RANK - the relative rank of this process on this node within its job. For example, if four processes in a job share a node, they will each be given a local rank ranging from 0 to 3.
OMPI_UNIVERSE_SIZE - the number of process slots allocated to this job. Note that this may be different than the number of processes in the job.
OMPI_COMM_WORLD_LOCAL_SIZE - the number of ranks from this job that are running on this node.
OMPI_COMM_WORLD_NODE_RANK - the relative rank of this process on this node looking across ALL jobs.

Open MPI guarantees that these variables will remain stable throughout future releases

164. How do I get my MPI job to wireup its MPI connections right away?

By default, Open MPI opens MPI connections between processes in a "lazy" fashion - i.e., the connections are only opened when the MPI process actually attempts to send a message to another process for the first time. This is done since (a) Open MPI has no idea what connections an application process will really use, and (b) creating the connections takes time. Once the connection is established, it remains "connected" until one of the two connected processes terminates, so the creation time cost is paid only once.

Applications that require a fully connected topology, however, can see improved startup time if they automatically "pre-connect" all their processes during MPI_Init. Accordingly, Open MPI provides the MCA parameter "mpi_preconnect_mpi" which directs Open MPI to establish a "mostly" connected topology during MPI_Init (note that this MCA parameter used to be named "mpi_preconnect_all" prior to Open MPI v1.5; in v1.5, it was deprecated and replaced with "mpi_preconnect_mpi"). This is accomplished in a somewhat scalable fashion to help minimize startup time.

Users can set this parameter in two ways:

in the environment as OMPI_MCA_mpi_preconnect_mpi=1
on the command line as mpirun -mca mpi_preconnect_mpi 1

See this FAQ entry for more details on how to set MCA parameters.

165. What kind of CUDA support exists in Open MPI?

See these two FAQ categories:

166. What are the Libfabric (OFI) components in Open MPI?

Open MPI has two main components for Libfabric (a.k.a., OFI) communications:

ofi MTL: Available since Open MPI v1.10, this component is used with the cm PML and is used for two-sided MPI communication (e.g., MPI_SEND and MPI_RECV). The ofi MTL requires that the Libfabric provider support reliable datagrams with ordered tagged messaging (specifically: FI_EP_RDM endpoints, FI_TAGGED capabilities, and FI_ORDER_SAS ordering).
ofi BTL: Available since Open MPI v4.0.0, this component is used for one-sided MPI communications (e.g., MPI_PUT). The ofi BTL requires that the Libfabric provider support reliable datagrams, RMA and atomic operations, and remote atomic completion notifications (specifically: FI_EP_RDM endpoints, FI_RMA and FI_ATOMIC capabilities, and FI_DELIVERY_COMPLETE op flags).

See each Lifabric provider man page (e.g., fi_sockets(7)) to understand which provider will work for each of the above-listed Open MPI components. Some providers may require to be used with one of the Libfabric utility providers; for example, the verbs provider needs to be paired with utility provider ofi_rxm to provide reliable datagram endpoint support (verbs;ofi_rxm).

Both components have MCA parameters to specify the Libfabric provider(s) that will be included/excluded in the selection process. For example:

1	shell$ mpirun --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2 mpi_hello

In addition, each component has specific parameters for each one; see ompi_info --param -level 9 for a full list. For example:

1	shell$ ompi_info --param mtl ofi --level 9

For more information refer to libfabric.org web site.

167. How can Open MPI communicate with Intel Omni-Path Architecture (OPA) based devices?

Currently, Open MPI supports PSM2 MTL and OFI MTL (using PSM2 OFI provider) components which can be used to communicate with Intel Omni-Path (OPA) software stack

For guidlines on tuning run-time characteristics when using OPA devices, please refer to this FAQ entry.

168. Open MPI tells me that it fails to load components with a "file not found" error — but the file is there! Why does it say this?

Open MPI loads a lot of plugins at run time. It opens its plugins via the excellent GNU Libtool libltdl portability library. If a plugin fails to load, Open MPI queries libltdl to get a printable string indicating why the plugin failed to load.

Unfortunately, there is a well-known bug in libltdl that may cause a "file not found" error message to be displayed, even when the file is found. The "file not found" error usually masks the real, underlying cause of the problem. For example:

1	mca: base: component_find: unable to open /opt/openmpi/mca_ras_dash_host: file not found (ignored)

Note that Open MPI put in a libltdl workaround starting with version 1.5. This workaround should print the real reason the plugin failed to load instead of the erroneous "file not found" message.

There are two common underlying causes why a plugin fails to load:

The plugin is for a different version of Open MPI. This FAQ entry has more information about this case.
The plugin cannot find shared libraries that it requires. For example, if the openib plugin fails to load, ensure that libibverbs.so can be found by the linker at run time (e.g., check the value of your LD_LIBRARY_PATH environment variable). The same is true for any other plugins that have shared library dependencies (e.g., the mx BTL and MTL plugins need to be able to find the libmyriexpress.so shared library at run time).

169. I see strange messages about missing symbols in my application; what do these mean?

Open MPI loads a lot of plugins at run time. It opens its plugins via the excellent GNU Libtool libltdl portability library. Sometimes a plugin can fail to load because it can't resolve all the symbols that it needs. There are a few reasons why this can happen.

The plugin is for a different version of Open MPI. See this FAQ entry for an explanation of how Open MPI might try to open the "wrong" plugins.
An application is trying to manually dynamically open libmpi in a private symbol space. For example, if an application is not linked against libmpi, but rather calls something like this:
1 2 3
/* This is a Linux example — the issue is similar/the same on other operating systems */ handle = dlopen("libmpi.so", RTLD_NOW | RTLD_LOCAL);
This is due to some deep run time linker voodoo — it is discussed towards the end of this post to the Open MPI developer's list. Briefly, the issue is this:
1. The dynamic library libmpi is opened in a "local" symbol space.
2. MPI_INIT is invoked, which tries to open Open MPI's plugins.
3. Open MPI's plugins rely on symbols in libmpi (and other Open MPI support libraries); these symbols must be resolved when the plugin is loaded.
4. However, since libmpi was opened in a "local" symbol space, its symbols are not available to the plugins that it opens.
5. Hence, the plugin fails to load because it can't resolve all of its symbols, and displays a warning message to that effect.
The ultimate fix for this issue is a bit bigger than Open MPI, unfortunately — it's a POSIX issue (as briefly described in the devel posting, above).
However, there are several common workarounds:
- Dynamically open libmpi in a public / global symbol scope — not a private / local scope. This will enable libmpi's symbols to be available for resolution when Open MPI dynamically opens its plugins.
- If libmpi is opened as part of some underlying framework where it is not possible to change the private / local scope to a public / global scope, then dynamically open libmpi in a public / global scope before invoking the underlying framework. This sounds a little gross (and it is), but at least the run-time linker is smart enough to not load libmpi twice — but it does keeps libmpi in a public scope.
- Use the --disable-dlopen or --disable-mca-dso options to Open MPI's configure script (see this FAQ entry for more details on these options). These options slurp all of Open MPI's plugins up in to libmpi — meaning that the plugins physically reside in libmpi and will not be dynamically opened at run time.
- Build Open MPI as a static library by configuring Open MPI with --disable-shared and --enable-static. This has the same effect as --disable-dlopen, but it also makes libmpi.a (as opposed to a shared library).

170. What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it?

You may wonder why you see this warning message (put here verbatim so that it becomes web-searchable):

1	mca_pml_teg.so:undefined symbol:mca_ptl_base_modules_initialized

This happens when you upgrade to Open MPI v1.1 (or later) over an old installation of Open MPI v1.0.x without previously uninstalling v1.0.x. There are fairly uninteresting reasons why this problem occurs; the simplest, safest solution is to uninstall version 1.0.x and then re-install your newer version. For example:

shell# cd /path/to/openmpi-1.0
shell# make uninstall
[... lots of output ...]
shell# cd /path/to/openmpi-1.1
shell# make install

The above example shows changing into the Open MPI 1.1 directory to re-install, but the same concept applies to any version after Open MPI version 1.0.x.

Note that this problem is fairly specific to installing / upgrading Open MPI from the source tarball. Pre-packaged installers (e.g., RPM) typically do not incur this problem.

171. Can I build shared libraries on AIX with the IBM XL compilers?

Short answer: in older versions of Open MPI, maybe.

Add "LDFLAGS=-Wl,-brtl" to your configure command line:

1	shell$ ./configure LDFLAGS=-Wl,-brtl ...

This enables "runtimelinking", which will make GNU Libtool name the libraries properly (i.e., *.so). More importantly, runtimelinking will cause the runtime linker to behave more or less like an ELF linker would (with respect to symbol resolution).

Future versions of OMPI may not require this flag (and "runtimelinking" on AIX).

NOTE: As of OMPI v1.2, AIX is no longer supported.

172. Why am I getting a seg fault in libopen-pal (or libopal)?

It is likely that you did not get a segv in libopen-pal (or "libopal", in older versions of Open MPI); it is likely that you are seeing a message like this:

1	[0] func:/opt/ompi/lib/libopen-pal.so.0(opal_backtrace_print+0x2b) [0x2a958de8a7]

Or something like this in older versions of Open MPI:

1	[0] func:/opt/ompi/lib/libopal.so.0 [0x2a958de8a7]

This is actually the function that is printing out the stack trace message; it is not the function that caused the segv itself. The function that caused the problem will be a few below this. Future versions of OMPI will simply not display this libopen-pal function in the segv reporting to avoid confusion.

Let's provide a concrete example. Take the following trivial MPI program that is guaranteed to cause a seg fault in MPI_COMM_WORLD rank 1:

#include <mpi.h>
 
int main(int argc, char **argv)
{
    int rank;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (rank == 1) {
        char *d = 0;
        /* This will cause a seg fault */
        *d = 3;
    }
 
    MPI_Finalize();
    return 0;
}

Running this code, you'll see something similar to the following:

shell$ mpicc segv.c -o segv -g
shell$ mpirun -np 2 --mca btl tcp,self segv
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0] func:/opt/ompi/lib/libopen-pal.so.0(opal_backtrace_print+0x2b) [0x2a958de8a7]
[1] func:/opt/ompi/lib/libopen-pal.so.0 [0x2a958dd2b7]
[2] func:/lib64/tls/libpthread.so.0 [0x3be410c320]
[3] func:segv(main+0x3c) [0x400894]
[4] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3be361c4bb]
[5] func:segv [0x4007ca]
*** End of error message ***

The real error was back up in main, which is #3 on the stack trace. But Open MPI's stack-tracing function (opal_backtrace_print, in this case) is what is displayed as #0, so it's an easy mistake to assume that libopen-pal is the culprit.

173. Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler?

Early versions of the Intel 9.1 C++ compiler series had problems with the Open MPI C++ bindings. Even trivial MPI applications that used the C++ MPI bindings could incur process failures (such as segmentation violations) or generate MPI-level errors complaining about invalid parameters.

Intel released a new version of their 9.1 series C++ compiler on October 5, 2006 (build 44) that seems to solve all of these issues. The Open MPI team recommends that all users needing the C++ MPI API upgrade to this version (or later) if possible. Since the problems are with the compiler, there is little that Open MPI can do to work around the issue; upgrading the compiler seems to be the only solution.

174. All my MPI applications segv! Why? (Intel Linux 12.1 compiler)

Users have reported on the Open MPI users mailing list multiple times that when they compile Open MPI with the Intel 12.1 compiler suite, Open MPI tools (such as the wrapper compilers, including mpicc) and MPI applications will seg fault immediately.

As far as we know, this affects both Open MPI v1.4.4 (and later) and v1.5.4 (and later).

Here's one example of a user reporting this to the Open MPI User's list.

The cause of the problem has turned out to be a bug in early versions of the Intel Linux 12.1 compiler series itself. *If you upgrade your Intel compiler to the latest version of the Intel 12.1 compiler suite and rebuild Open MPI, the problem will go away.*

175. Why can't I attach my parallel debugger (TotalView, DDT, fx2, etc.) to parallel jobs?

As noted in this FAQ entry, Open MPI supports parallel debuggers that utilize the TotalView API for parallel process attaching. However, it can sometimes fail if Open MPI is not installed correctly. Symptoms of this failure typically involve having the debugger hang (or crash) when attempting to attach to a parallel MPI application.

Parallel debuggers may rely on having Open MPI's mpirun program being compiled without optimization. Open MPI's configure and build process therefore attempts to identify optimization flags and remove them when compiling mpirun, but it does not have knowledge of all optimization flags for all compilers. Hence, if you specify some esoteric optimization flags to Open MPI's configure script, some optimization flags may slip through the process and create an mpirun that cannot be read by TotalView and other parallel debuggers.

If you run into this problem, you can manully build mpirun without optimization flags. Go into the tree where you built Open MPI:

shell$ cd /path/to/openmpi/build/tree
shell$ cd orte/tools/orterun
shell$ make clean
[...output not shown...]
shell$ make all CFLAGS=-g
[...output not shown...]
shell$

This will build mpirun (also known as orterun) with just the "-g" flag. Once this completes, run make install, also from within the orte/tools/orterun directory, and possibly as root depending on where you installed Open MPI. Using this new orterun (mpirun), your parallel debugger should be able to attach to MPI jobs.

Additionally, a user reported to us that setting some TotalView flags may be helpful with attaching. The user specifically cited the Open MPI v1.3 series compiled with the Intel 11 compilers and TotalView 8.6, but it may also be helpful for other versions, too:

1	shell$ export with_tv_debug_flags="-O0 -g -fno-inline-functions"

176. When launching large MPI jobs, I see messages like: mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying

This is a known issue in the Open MPI v1.2 series. Try the following:

If you are using Linux-based systems, increase some of the limits on the node where mpirun is invoked (you must have administrator/root privlidges to increase these limits):

# The default is 128; increase it to 10,000
shell# echo 10000 > /proc/sys/net/core/somaxconn
 
# The default is 1,000; increase it to 100,000
shell# echo 100000 > /proc/sys/net/core/netdev_max_backlog

Set the oob_tcp_listen_mode MCA parameter to the string value listen_thread. This enables Open MPI's mpirun to respond much more quickly to incoming TCP connections during job launch, for example:
1
shell$ mpirun --mca oob_tcp_listen_mode listen_thread -np 1024 my_mpi_program
See this FAQ entry for more details on how to set MCA parameters.

177. How do I find out what MCA parameters are being seen/used by my job?

As described elsewhere, MCA parameters are the "life's blood" of Open MPI. MCA parameters are used to control both detailed and large-scale behavior of Open MPI and are present throughout the code base.

This raises an important question: since MCA parameters can be set from a file, the environment, the command line, and even internally within Open MPI, how do I actually know what MCA params my job is seeing, and their value?

One way, of course, is to use the ompi_info command, which is documented elsewhere (you can use "man ompi_info", or "ompi_info --help" to get more info on this command). However, this still doesn't fully answer the question since ompi_info isn't an MPI process.

To help relieve this problem, Open MPI (starting with the 1.3 release) provides the MCA parameter mpi_show_mca_params that directs the rank=0 MPI process to report the name of MCA parameters, their current value as seen by that process, and the source that set that value. The parameter can take several values that define which MCA parameters to report:

all: report all MCA params. Note that this typically generates a rather long list of parameters since it includes all of the default parameters defined inside Open MPI
default: MCA params that are at their default settings - i.e., all MCA params that are at the values set as default within Open MPI
file: MCA params that had their value set by a file
api: MCA params set using Open MPI's internal APIs, perhaps to override an incompatible set of conditions specified by the user
enviro: MCA params that obtained their value either from the local environment or the command line. Open MPI treats environmental and command line parameters as equivalent, so there currently is no way to separate these two sources

These options can be combined in any order by separating them with commas.

Here is an example of the output generated by this parameter:

$ mpirun -mca grpcomm basic -mca mpi_show_mca_params enviro ./hello
ess=env (environment or cmdline)
orte_ess_jobid=1016725505 (environment or cmdline)
orte_ess_vpid=0 (environment or cmdline)
grpcomm=basic (environment or cmdline)
mpi_yield_when_idle=0 (environment or cmdline)
mpi_show_mca_params=enviro (environment or cmdline)
Hello, World, I am 0 of 1

Note that several MCA parameters set by Open MPI itself for internal uses are displayed in addition to the ones actually set by the user.

Since the output from this option can be long, and since it can be helpful to have a more permanent record of the MCA parameters used for a job, a companion MCA parameter mpi_show_mca_params_file is provided. If mpi_show_mca_params is also set, the output listing of MCA parameters will be directed into the specified file instead of being printed to stdout.

178. How do I debug Open MPI processes in parallel?

This is a difficult question. Debugging in serial can be tricky: errors, uninitialized variables, stack smashing, etc. Debugging in parallel adds multiple different dimensions to this problem: a greater propensity for race conditions, asynchronous events, and the general difficulty of trying to understand N processes simultaneously executing — the problem becomes quite formidable.

This FAQ section does not provide any definite solutions to debugging in parallel. At best, it shows some general techniques and a few specific examples that may be helpful to your situation.

But there are various controls within Open MPI that can help with debugging. These are probably the most valuable entries in this FAQ section.

179. What tools are available for debugging in parallel?

There are two main categories of tools that can aid in parallel debugging:

Debuggers: Both serial and parallel debuggers are useful. Serial debuggers are what most programmers are used to (e.g., gdb), while parallel debuggers can attach to all the individual processes in an MPI job simultaneously, treating the MPI application as a single entity. This can be an extremely powerful abstraction, allowing the user to control every aspect of the MPI job, manually replicate race conditions, etc.

Profilers: Tools that analyze your usage of MPI and display statistics and meta information about your application's run. Some tools present the information "live" (as it occurs), while others collect the information and display it in a post mortem analysis.

Both freeware and commercial solutions are available for each kind of tool.

180. How do I run with parallel debuggers?

See these FAQ entries:

181. What controls does Open MPI have that aid in debugging?

Open MPI has a series of MCA parameters for the MPI layer itself that are designed to help with debugging. These parameters can be can be set in the usual ways. MPI-level MCA parameters can be displayed by invoking the following command:

# Starting with Open MPI v1.7, you must use "--level 9" to see
# all the MCA parameters (the default is "--level 1"):
shell$ ompi_info --param mpi all --level 9
 
# Before Open MPI v1.7:
shell$ ompi_info --param mpi all

Here is a summary of the debugging parameters for the MPI layer:

mpi_param_check: If set to true (any positive value), and when Open MPI is compiled with parameter checking enabled (the default), the parameters to each MPI function can be passed through a series of correctness checks. Problems such as passing illegal values (e.g., NULL or MPI_DATATYPE_NULL or other "bad" values) will be discovered at run time and an MPI exception will be invoked (the default of which is to print a short message and abort the entire MPI job). If set to 0, these checks are disabled, slightly increasing performance.

mpi_show_handle_leaks: If set to true (any positive value), OMPI will display lists of any MPI handles that were not freed before MPI_FINALIZE (e.g., communicators, datatypes, requests, etc.).

mpi_no_free_handles: If set to true (any positive value), do not actually free MPI objects when their corresponding MPI "free" function is invoked (e.g., do not free communicators when MPI_COMM_FREE is invoked). This can be helpful in tracking down applications that accidentally continue to use MPI handles after they have been freed.

mpi_show_mca_params: If set to true (any positive value), show a list of all MCA parameters and their values during MPI_INIT. This can be quite helpful for reproducibility of MPI applications.

mpi_show_mca_params_file: If set to a non-empty value, and if the value of mpi_show_mca_params is true, then output the list of MCA parameters to the filename value. If this parameter is an empty value, the list is sent to stderr.

mpi_keep_peer_hostnames: If set to a true value (any positive value), send the list of all hostnames involved in the MPI job to every process in the job. This can help the specificity of error messages that Open MPI emits if a problem occurs (i.e., Open MPI can display the name of the peer host that it was trying to communicate with), but it can somewhat slow down the startup of large-scale MPI jobs.

mpi_abort_delay: If nonzero, print out an identifying message when MPI_ABORT is invoked showing the hostname and PID of the process that invoked MPI_ABORT, and then delay that many seconds before exiting. A negative value means to delay indefinitely. This allows a user to manually come in and attach a debugger when an error occurs. Remember that the default MPI error handler — MPI_ERRORS_ABORT — invokes MPI_ABORT, so this parameter can be useful to discover problems identified by mpi_param_check.

mpi_abort_print_stack: If nonzero, print out a stack trace (on supported systems) when MPI_ABORT is invoked.

mpi_ddt_<foo>_debug, where <foo> can be one of pack, unpack, position, or copy: These are internal debugging features that are not intended for end users (but ompi_info will report that they exist).

182. Do I need to build Open MPI with compiler/linker debugging flags (such as -g) to be able to debug MPI applications?

No.

If you build Open MPI without compiler/linker debugging flags (such as -g), you will not be able to step inside MPI functions when you debug your MPI applications. However, this is likely what you want — the internals of Open MPI are quite complex and you probably don't want to start poking around in there.

You'll need to compile your own applications with -g (or whatever your compiler's equivalent is), but unless you have a need/desire to be able to step into MPI functions to see the internals of Open MPI, you do not need to build Open MPI with -g.

183. Can I use serial debuggers (such as gdb) to debug MPI applications?

Yes; the Open MPI developers do this all the time.

There are two common ways to use serial debuggers:

Attach to individual MPI processes after they are running.

For example, launch your MPI application as normal with mpirun. Then login to the node(s) where your application is running and use the --pid option to gdb to attach to your application.

An inelegant-but-functional technique commonly used with this method is to insert the following code in your application where you want to attach:

{
    volatile int i = 0;
    char hostname[256];
    gethostname(hostname, sizeof(hostname));
    printf("PID %d on %s ready for attach\n", getpid(), hostname);
    fflush(stdout);
    while (0 == i)
        sleep(5);
}

This code will output a line to stdout outputting the name of the host where the process is running and the PID to attach to. It will then spin on the sleep() function forever waiting for you to attach with a debugger. Using sleep() as the inside of the loop means that the processor won't be pegged at 100% while waiting for you to attach.

Once you attach with a debugger, go up the function stack until you are in this block of code (you'll likely attach during the sleep()) then set the variable i to a nonzero value. With GDB, the syntax is:

1	(gdb) set var i = 7

Then set a breakpoint after your block of code and continue execution until the breakpoint is hit. Now you have control of your live MPI application and use of the full functionality of the debugger.

You can even add conditionals to only allow this "pause" in the application for specific MPI processes (e.g., MPI_COMM_WORLD rank 0, or whatever process is misbehaving).

Use mpirun to launch separate instances of serial debuggers.
This technique launches a separate window for each MPI process in MPI_COMM_WORLD, each one running a serial debugger (such as gdb) that will launch and run your MPI application. Having a separate window for each MPI process can be quite handy for low process-count MPI jobs, but requires a bit of setup and configuration that is outside of Open MPI to work properly. A naive approach would be to assume that the following would immediately work:
1
shell$ mpirun -np 4 xterm -e gdb my_mpi_application
If running on a personal computer, this will probably work. You can also use tmpi to launch the debuggers in separate tmux panes instead of separate xterm windows, which has the advantage of synchronizing keyboard input between all debugger instances.
Unfortunately, the tmpi or xterm approaches likely won't work on an computing cluster. Several factors must be considered:
1. What launcher is Open MPI using? In an rsh/ssh environment, Open MPI will default to using ssh when it is available, falling back to rsh when ssh cannot be found in the $PATH. But note that Open MPI closes the ssh (or rsh) sessions when the MPI job starts for scalability reasons. This means that the built-in SSH X forwarding tunnels will be shut down before the xterms can be launched. Although it is possible to force Open MPI to keep its SSH connections active (to keep the X tunneling available), we recommend using non-SSH-tunneled X connections, if possible (see below).
2. In non-rsh/ssh environments (such as when using resource managers), the environment of the process invoking mpirun may be copied to all nodes. In this case, the DISPLAY environment variable may not be suitable.
3. Some operating systems default to disabling the X11 server from listening for remote/network traffic. For example, see this post on the user's mailing list, describing how to enable network access to the X11 server on Fedora Linux.
4. There may be intermediate firewalls or other network blocks that prevent X traffic from flowing between the hosts where the MPI processes (and xterms) are running and the host connected to the output display.
The easiest way to get remote X applications (such as xterm) to display on your local screen is to forego the security of SSH-tunneled X forwarding. In a closed environment such as an HPC cluster, this may be an acceptable practice (indeed, you may not even have the option of using SSH X forwarding if SSH logins to cluster nodes are disabled), but check with your security administrator to be sure.
If using non-encrypted X11 forwarding is permissible, we recommend the following:
1. For each non-local host where you will be running an MPI process, add it to your X server's permission list with the xhost command. For example:
  1 2 3 4 5 6
  shell$ cat my_hostfile inky blinky stinky clyde shell$ for host in `cat my_hostfile` ; do xhost +host ; done
2. Use the -x option to mpirun to export an appropriate DISPLAY variable so that the launched X applications know where to send their output. An appropriate value is usually (but not always) the hostname containing the display where you want the output and the :0 (or :0.0) suffix. For example:
  1 2 3 4
  shell$ hostname arcade.example.come shell$ mpirun -np 4 --hostfile my_hostfile \ -x DISPLAY=arcade.example.com:0 xterm -e gdb my_mpi_application
  Note that X traffic is fairly "heavy" — if you are operating over a slow network connection, it may take some time before the xterm windows appear on your screen.
3. If your xterm supports it, the -hold option may be useful. -hold tells xterm to stay open even when the application has completed. This means that if something goes wrong (e.g., gdb fails to execute, or unexpectedly dies, or ...), the xterm window will stay open, allowing you to see what happened, instead of closing immediately and losing whatever error message may have been output.
4. When you have finished, you may wish to disable X11 network permissions from the hosts that you were using. Use xhost again to disable these permissions:
  1
  shell$ for host in `cat my_hostfile` ; do xhost -host ; done
Note that mpirun will not complete until all the xterms complete.

184. My process dies without any output. Why?

There many be many reasons for this; the Open MPI Team strongly encourages the use of tools (such as debuggers) whenever possible.

One of the reasons, however, may come from inside Open MPI itself. If your application fails due to memory corruption, Open MPI may subsequently fail to output an error message before dying. Specifically, starting with v1.3, Open MPI attempts to aggregate error messages from multiple processes in an attempt to show unique error messages only once (vs. one for each MPI process — which can be unwieldy, especially when running large MPI jobs).

However, this aggregation process requires allocating memory in the MPI process when it displays the error message. If the process's memory is already corrupted, Open MPI's attempt to allocate memory may fail and the process will simply die, possibly silently. When Open MPI does not attempt to aggregate error messages, most of its setup work is done during MPI_INIT and no memory is allocated during the "print the error" routine. It therefore almost always successfully outputs error messages in real time — but at the expense that you'll potentially see the same error message for each MPI process that encountered the error.

Hence, the error message aggregation is usually a good thing, but sometimes it can mask a real error. You can disable Open MPI's error message aggregation with the orte_base_help_aggregate MCA parameter. For example:

1	shell$ mpirun --mca orte_base_help_aggregate 0 ...

185. What is Memchecker?

The Memchecker-MCA is implemented to allow MPI-semantic checking for your application (as well as internals of Open MPI), with the help of memory checking tools such as the Memcheck of the Valgrind-suite (http://www.valgrind.org/).

Memchecker component is included in Open MPI v1.3 and later.

186. What kind of errors can Memchecker find?

Memchecker is implemented on the basis of the Memcheck tool from Valgrind, so it takes all the advantages from it. Firstly, it checks all reads and writes of memory, and intercepts calls to malloc/new/free/delete. Most importantly, Memchecker is able to detect the user buffer errors in both Non-blocking and One-sided communications, e.g. reading or writing to buffers of active non-blocking Recv-operations and writing to buffers of active non-blocking Send-operations.

Here are some example codes that Memchecker can detect:

Accessing buffer under control of non-blocking communication:

int buf;
MPI_Irecv(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
// The following line will produce a memchecker warning
buf = 4711;
MPI_Wait (&req, &status);

Wrong input parameters, e.g. wrongly sized send buffers:

char *send_buffer;
send_buffer = malloc(5);
memset(send_buffer, 0, 5);
// The following line will produce a memchecker warning
MPI_Send(send_buffer, 10, MPI_CHAR, 1, 0, MPI_COMM_WORLD);

Accessing window under control of one-sided communication:

1
2
3

MPI_Get(A, 10, MPI_INT, 1, 0, 1, MPI_INT, win);
A[0] = 4711;
MPI_Win_fence(0, win);

Uninitialized input buffers:

char *buffer;
buffer = malloc(10);
// The following line will produce a memchecker warning
MPI_Send(buffer, 10, MPI_INT, 1, 0, MPI_COMM_WORLD);

Usage of the uninitialized MPI_Status field in MPI_ERROR structure: (The MPI-1 standard defines the MPI ERROR-field to be undefined for single-completion calls such as MPI Wait or MPI Test, see MPI-1 p. 22):

MPI_Wait(&request, &status);
// The following line will produce a memchecker warning
if (status.MPI_ERROR != MPI_SUCCESS)
    return ERROR;

187. How can I use Memchecker?

To use Memchecker, you need Open MPI 1.3 or later, and Valgrind 3.2.0 or later.

As this functionality is off by default, one needs to turn them on with the configure flag --enable-memchecker. Then, configure will check for a recent Valgrind-distribution and include the compilation of ompi/opal/mca/memchecker. You may ensure that the library is being built by using the ompi_info application. Please note that all of this will only make sense together with --enable-debug, which is required by Valgrind for outputting messages pointing directly to the relevant source code lines. Otherwise, without debugging info, the messages from Valgrind are nearly useless.

Here is a configuration example to enable Memchecker:

1 2	shell$ ./configure --prefix=/path/to/openmpi --enable-debug \ --enable-memchecker --with-valgrind=/path/to/valgrind

To check if Memchecker is successfully enabled after installation, simply run this command:

1	shell$ ompi_info \| grep memchecker

You will get an output message like this:

1	MCA memchecker: valgrind (MCA v1.0, API v1.0, Component v1.3)

Otherwise, you probably didn't configure and install Open MPI correctly.

188. How to run my MPI application with Memchecker?

First of all, you have to make sure that Valgrind 3.2.0 or later is installed, and Open MPI is compiled with Memchecker enabled. Then simply run your application with Valgrind, e.g.:

1	shell$ mpirun -np 2 valgrind ./my_app

Or if you enabled Memchecker, but you don't want to check the application at this time, then just run your application as usual. E.g.:

1	shell$ mpirun -np 2 ./my_app

189. Does Memchecker cause performance degradation to my application?

The configure option --enable-memchecker (together with --enable-debug) does cause performance degradation, even if not running under Valgrind. The following explains the mechanism and may help in making the decision whether to provide a cluster-wide installation with --enable-memchecker.

There are two cases:

If run without Valgrind, the Valgrind ClientRequests (assembler instructions added to the normal execution path for checking) do not affect overall MPI performance. Valgrind ClientRequests are explained in detail in Valgrind's documentation. In the case of x86-64 ClientRequests boil down to the following four rotate-left (ROL) and one xchange (XCHG) assembler instructions (from valgrind.h):

1
2
3

#define __SPECIAL_INSTRUCTION_PREAMBLE                      \
                     "rolq $3,  %%rdi; rolq $13, %%rdi\n\t" \
                     "rolq $61, %%rdi; rolq $51, %%rdi\n\t"

and

    __asm__ volatile(__SPECIAL_INSTRUCTION_PREAMBLE         \
                     /* %RDX = client_request ( %RAX ) */   \
                     "xchgq %%rbx,%%rbx"                    \
                     : "=d" (_zzq_result)                   \
                     : "a" (& _zzq_args[0]), "0" (_zzq_default)   \
                     : "cc", "memory"                       \
                    );                                      \

for every single ClientRequest. In the case of not running Valgrind, these ClientRequest instructions do not change the arithmetic outcome (rotating a 64-bit register left by 128-Bits, exchanging a register with itself), except for the carry flag.

The first request is checking whether we're running under Valgrind. In case we're not running under Valgrind subsequent checks (aka ClientRequests) are not done.

If the application is run under Valgrind, performance is naturally reduced due to the Valgrind JIT and the checking tool employed. For costs and overheads of Valgrind's Memcheck tool on the SPEC 2000 Benchmark, please see the excellent paper "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation". For an evaluation of various internal implementation alternatives of Shadow Memory, please see "Building Workload Characterization Tools with Valgrind".

Further information and performance data with the NAS Parallel Benchmarks may be found in the paper Memory Debugging of MPI-Parallel Applications in Open MPI.

190. Is Open MPI 'Valgrind-clean' or how can I identify real errors?

This issue has been raised many times on the mailing list, e.g., here or here.

There are many situations where Open MPI purposefully does not initialize and subsequently communicates memory, e.g., by calling writev. Furthermore, several cases are known where memory is not properly freed upon MPI_Finalize.

This certainly does not help distinguishing real errors from false positives. Valgrind provides functionality to suppress errors and warnings from certain function contexts.

In an attempt to ease debugging using Valgrind, starting with v1.5, Open MPI provides a so-called Valgrind-suppression file, that can be passed on the command line:

1	mpirun -np 2 valgrind --suppressions=$PREFIX/share/openmpi/openmpi-valgrind.supp

More information on suppression-files and how to generate them can be found in Valgrind's Documentation.

191. Can I make Open MPI use rsh instead of ssh?

Yes. The method to do this has changed over the different versions of Open MPI.

v1.7 and later series: The plm_rsh_agent MCA parameter accepts a colon-delimited list of programs to search for in your path to use as the remote startup agent. The default value is ssh : rsh, meaning that it will look for ssh first, and if it doesn't find it, use rsh. You can change the value of this parameter as relevant to your environment, such as simply changing it to rsh or rsh : ssh if you have a mixture. The deprecated forms pls_rsh_agent and orte_rsh_agent will also work.

v1.3 to v1.6 series: The orte_rsh_agent MCA parameter accepts a colon-delimited list of programs to search for in your path to use as the remote startup agent (the MCA parameter name plm_rsh_agent also works, but it is deprecated). The default value is ssh : rsh, meaning that it will look for ssh first, and if it doesn't find it, use rsh. You can change the value of this parameter as relevant to your environment, such as simply changing it to rsh or rsh : ssh if you have a mixture.

v1.1 and v1.2 series: The v1.1 and v1.2 method is exactly the same as the v1.3 method, but the MCA parameter name is slightly different: pls_rsh_agent ("pls" vs. "plm"). Using the old "pls" name will continue to work in the v1.3 series, but it is now officially deprecated — you'll receive a warning if you use it.

v1.0 series: In the 1.0.x series, Open MPI defaults to using ssh for remote startup of processes in unscheduled environments. You can change this to rsh by setting the MCA parameter pls_rsh_agent to rsh.

See this FAQ entry for details on how to set MCA parameters — particularly with multi-word values.

192. What prerequisites are necessary for running an Open MPI job under rsh/ssh?

In general, they are the same for running Open MPI jobs in other environments (see this FAQ category for more general information).

193. How can I make ssh not ask me for a password?

If you are using ssh to launch processes on remote nodes, there are multiple ways.

Note that there are multiple versions of ssh available. References to ssh in this text refer to OpenSSH.

This documentation provides an overview for using user keys and the OpenSSH 2.x key management agent (if your OpenSSH only supports 1.x key management, you should upgrade). See the OpenSSH documentation for more details and a more thorough description. The process is essentially the same for other versions of SSH, but the command names and filenames may be slightly different. Consult your SSH documentation for more details.

Normally, when you use ssh to connect to a remote host, it will prompt you for your password. However, for the easiest way for mpirun (and mpiexec, which, in Open MPI, is identical to mpirun) to work properly, you need to be able to execute jobs on remote nodes without typing in a password. In order to do this, you will need to set up a passphrase. We recommend using RSA passphrases as they are generally "better" (i.e., more secure) than DSA passphrases. As such, this text will describe the process for RSA setup.

NOTE: This text will briefly show you the steps involved in doing this, but the ssh documentation is authorative on these matters should be consulted for more information.

The first thing that you need to do is generate an RSA key pair to use with ssh-keygen:

1	shell$ ssh-keygen -t rsa

Accept the default value for the file in which to store the key ($HOME/.ssh/id_rsa) and enter a passphrase for your key pair. You may choose to not enter a passphrase and therefore obviate the need for using the ssh-agent. However, this greatly weakens the authentication that is possible, because your secret key is potentially vulnerable to compromise because it is unencrypted. It has been compared to the moral equivalent of leaving a plain text copy of your password in your $HOME directory. See the ssh documentation for more details.

Next, copy the $HOME/.ssh/id_rsa.pub file generated by ssh-keygen to $HOME/.ssh/authorized_keys (or add it to the end of authorized_keys if that file already exists):

1 2	shell$ cd $HOME/.ssh shell$ cp id_rsa.pub authorized_keys

In order for RSA authentication to work, you need to have the $HOME/.ssh directory in your home directory on all the machines you are running Open MPI on. If your home directory is on a common filesystem, this may be already taken care of. If not, you will need to copy the $HOME/.ssh directory to your home directory on all Open MPI nodes. (Be sure to do this in a secure manner — perhaps using the scp command — particularly if your secret key is not encrypted.)

ssh is very particular about file permissions. Ensure that your home directory on all your machines is set to at least mode 755, your $HOME/.ssh directory is also set to at least mode 755, and that the following files inside $HOME/.ssh have at least the following permissions:

-rw-r--r--  authorized_keys
-rw-------  id_rsa
-rw-r--r--  id_rsa.pub
-rw-r--r--  known_hosts

The phrase "at least" in the above paragraph means the following:

The files need to be readable by you.
The files should only be writable by you.
The files should not be executable.
Aside from id_rsa, the files can be readable by others, but do not need to be.
Your $HOME and $HOME/.ssh directories can be readable by others, but do not need to be.

You are now set up to use RSA authentication. However, when you ssh to a remote host, you will still be asked for your RSA passphrase (as opposed to your normal password). This is where the ssh-agent program comes in. It allows you to type in your RSA passphrase once, and then have all successive invocations of ssh automatically authenticate you against the remote host. See the ssh-agent(1) documentation for more details than what are provided here.

Additionally, check the documentation and setup of your local environment; ssh-agent may already be setup for you (e.g., see if the shell environment variable $SSH_AUTH_SOCK exists; if so, ssh-agent is likely already running). If ssh-agent is not already running, you can start it manually with the following:

1	shell$ eval `ssh-agent`

Note the specific invocation method: ssh-agent outputs some shell commands to its output (e.g., setting the SSH_AUTH_SOCK environment variable).

You will probably want to start the ssh-agent before you start your graphics / windowing system so that all your windows will inherit the environment variables set by this command. Note that some sites invoke ssh-agent for each user upon login automatically; be sure to check and see if there is an ssh-agent running for you already.

Once the ssh-agent is running, you can tell it your passphrase by running the ssh-add command:

1	shell$ ssh-add $HOME/.ssh/id_rsa

At this point, if you ssh to a remote host that has the same $HOME/.ssh directory as your local one, you should not be prompted for a password or passphrase. If you are, a common problem is that the permissions in your $HOME/.ssh directory are not as they should be.

Note that this text has covered the ssh commands in _very little detail._ Please consult the ssh documentation for more information.

194. What is a .rhosts file? Do I need it?

If you are using rsh to launch processes on remote nodes, you will probably need to have a $HOME/.rhosts file.

This file allows you to execute commands on remote nodes without being prompted for a password. The permissions on this file usually must be 0644 (rw-r--r--). It must exist in your home directory on every node that you plan to use Open MPI with.

Each line in the .rhosts file indicates a machine and user that programs may be launched from. For example, if the user steve wishes to launch programs from the machine stevemachine to the machines alpha, beta, and gamma, there must be a .rhosts file on each of the three remote machines (alpha, beta, and gamma) with at least the following line in it:

1	stevemachine steve

The first field indicates the name of the machine where jobs may originate from; the second field indicates the user ID who may originate jobs from that machine. It is better to supply a fully-qualified domain name for the machine name (for security reasons — there may be many machines named stevemachine on the internet). So the above example should be:

1	stevemachine.example.com steve

*The Open MPI Team strongly discourages the use of "+" in the .rhosts file. This is always a huge security hole.*

If rsh does not find a matching line in the $HOME/.rhosts file, it will prompt you for a password. Open MPI requires the password-less execution of commands; if rsh prompts for a password, mpirun will fail.

NOTE: Some implementations of rsh are very picky about the format of text in the .rhosts file. In particular, some do not allow leading white space on each line in the .rhosts file, and will give a misleading "permission denied" error if you have white space before the machine name.

NOTE: It should be noted that rsh is not considered "secure" or "safe" — .rhosts authentication is considered fairly weak. The Open MPI Team recommends that you use ssh ("Secure Shell") to launch remote programs as it uses a much stronger authentication system.

195. Should I use + in my .rhosts file?

No!

While there are a very small number of cases where using "+" in your .rhosts file may be acceptable, the Open MPI Team highly recommends that you do not.

Using a "+" in your .rhosts file indicates that you will allow any machine and/or any user to connect as you. This is extremely dangerous, especially on machines that are connected to the internet. Consider the fact that anyone on the internet can connect to your machine (as you) — it should strike fear into your heart.

The + should not be used for either field of the .rhosts file.

Instead, you should use the full and proper hostname and username of accounts that are authorized to remotely login as you to that machine (or machines). This is usually just a list of your own username on a list of machines that you wish to run Open MPI with. See this FAQ entry for further details, as well as your local rsh documentation.

Additionally, the Open MPI Team strongly recommends that rsh is not used in unscheduled environments (espectially those connected to the internet) — it is considered weak remote authentication. Instead, we recommend the use of ssh — the secure remote shell. See this FAQ entry for more details.

196. What versions of BProc does Open MPI work with?

BProc support was dropped from Open MPI in the Open MPI v1.3 series.

The last version of Open MPI to include BProc support was Open MPI 1.2.9, which was released in February of 2009.

As of December 2005, Open MPI supports recent versions of BProc, such as those found in Clustermatic. We have not tested with older forks of the BProc project, such as those from Scyld (now defunct). Since Open MPI's BProc support uses some advanced support from recent BProc versions, it is somewhat doubtful (but totally untested) as to whether it would work on Scyld systems.

197. What prerequisites are necessary for running an Open MPI job under BProc?

In general, they are the same for running Open MPI jobs in other environments (see this FAQ category for more general information).

However, with BProc it is worth noting that BProc may not bring all necessary dynamic libraries with a process when it is migrated to a back-end compute node. Plus, Open MPI opens components on the fly (i.e., after the process has started), so if these components are unavailable on the back-end compute nodes, Open MPI applications may fail.

In general the Open MPI team recommends one of the following two solutions when running on BProc clusters (in order):

Compile Open MPI statically, meaning that Open MPI's libraries produce static ".a" libraries and all components are included in the library (as opposed to dynamic ".so" libraries, and separate ".so" files for each component that is found and loaded at run-time) so that applications do not need to find any shared libraries or components when they are migrated to back-end compute nodes. This can be accomplished by specifying [--enable-static --disable-shared] to configure when building Open MPI.

If you do not wish to use static compilation, ensure that Open MPI is fully installed on all nodes (i.e., the head node and all compute nodes) in the same directory location. For example, if Open MPI is installed in /opt/openmpi-5.0.3 on the head node, ensure that it is also installed in that same directory on all the compute nodes.

198. How do I run jobs under Torque / PBS Pro?

The short answer is just to use mpirun as normal.

When properly configured, Open MPI obtains both the list of hosts and how many processes to start on each host from Torque / PBS Pro directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will use PBS/Torque-native mechanisms to launch and kill processes (rsh and/or ssh are not required).

For example:

# Allocate a PBS job with 4 nodes
shell$ qsub -I -lnodes=4
 
# Now run an Open MPI job on all the nodes allocated by PBS/Torque
# (starting with Open MPI v1.2; you need to specify -np for the 1.0
# and 1.1 series).
shell$ mpirun my_mpi_application

This will run the 4 MPI processes on the nodes that were allocated by PBS/Torque. Or, if submitting a script:

shell$ cat my_script.sh
#!/bin/sh
mpirun my_mpi_application
shell$ qsub -l nodes=4 my_script.sh

199. Does Open MPI support Open PBS?

As of this writing, Open PBS is so ancient that we are not aware of any sites running it. As such, we have never tested Open MPI with Open PBS and therefore do not know if it would work or not.

200. How does Open MPI get the list of hosts from Torque / PBS Pro?

Open MPI has changed how it obtains hosts from Torque / PBS Pro over time:

v1.0 and v1.1 series: The list of hosts allocated to a Torque / PBS Pro job is obtained directly from the scheduler using the internal TM API.

v1.2 series: Due to scalability limitations in how the TM API was used in the v1.0 and v1.1 series, Open MPI was modified to read the $PBS_NODEFILE to obtain hostnames. Specifically, reading the $PBS_NODEFILE is much faster at scale than how the v1.0 and v1.1 series used the TM API.

It is possible that future versions of Open MPI may switch back to using the TM API in a more scalable fashion, but there isn't currently a huge demand for it (reading the $PBS_NODEFILE works just fine).

Note that the TM API is used to launch processes in all versions of Open MPI; the only thing that has changed over time is how Open MPI obtains hostnames.

201. What happens if $PBS_NODEFILE is modified?

Bad Things will happen.

We've had reports from some sites that system administrators modify the $PBS_NODEFILE in each job according to local policies. This will currently cause Open MPI to behave in an unpredictable fashion. As long as no new hosts are added to the hostfile, it usually means that Open MPI will incorrectly map processes to hosts, but in some cases it can cause Open MPI to fail to launch processes altogether.

The best course of action is to not modify the $PBS_NODEFILE.

202. Can I specify a hostfile or use the --host option to mpirun when running in a Torque / PBS environment?

Prior to v1.3, no.

Open MPI <v1.3 will fail to launch processes properly when a hostfile is specified on the mpirun command line, or if the mpirun --host option is used.

As of v1.3, Open MPI can use the --hostfile and --host options in conjunction with TM jobs.

203. How do I determine if Open MPI is configured for Torque/PBS Pro?

If you are configuring and installing Open MPI yourself, and you want to insure that you are building the components of Open MPI required for Torque/PBS Pro support, include the --with-tm option on the configure command line. Run ./configure --help for further information about this configure option.

The ompi_info command can be used to determine whether or not an installed Open MPI includes Torque/PBS Pro support:

1	shell$ ompi_info \| grep ras

If the Open MPI installation includes support for Torque/PBS Pro, you should see a line similar to that below. Note the MCA version information varies depending on which version of Open MPI is installed.

1	MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v3.0.0)

204. How do I run with the SGE launcher?

Support for SGE is included in Open MPI version 1.2 and later.

NOTE: To build SGE support in v1.3, you will need to explicitly request the SGE support with the "--with-sge" command line switch to Open MPI's configure script.

See this FAQ entry for a description of how to correctly build Open MPI with SGE support.

To verify if support for SGE is configured into your Open MPI installation, run ompi_info as shown below and look for gridengine. The components you will see are slightly different between v1.2 and v1.3.

For Open MPI v1.2:

1
2
3

shell$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v1.0, API v1.0, Component v1.2)
                 MCA pls: gridengine (MCA v1.0, API v1.0, Component v1.2)

For Open MPI v1.3:

1 2	shell$ ompi_info \| grep gridengine MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

Open MPI will automatically detect when it is running inside SGE and will just "do the Right Thing."

Specifically, if you execute an mpirun command in a SGE job, it will automatically use the SGE mechanisms to launch and kill processes. There is no need to specify what nodes to run on — Open MPI will obtain this information directly from SGE and default to a number of processes equal to the slot count specified. For example, this will run 4 MPI processes on the nodes that were allocated by SGE:

# Get the environment variables for SGE
# (Assuming SGE is installed at /opt/sge and $SGE_CELL is 'default' in your environment)
# C shell settings
shell% source /opt/sge/default/common/settings.csh
 
# bourne shell settings
shell$ . /opt/sge/default/common/settings.sh
 
# Allocate an SGE interactive job with 4 slots from a parallel
# environment (PE) named 'orte' and run a 4-process Open MPI job
shell$ qrsh -pe orte 4 -b y mpirun -np 4 a.out

There are also other ways to submit jobs under SGE:

# Submit a batch job with the 'mpirun' command embedded in a script
shell$ qsub -pe orte 4 my_mpirun_job.csh
 
# Submit an SGE and OMPI job and mpirun in one line
shell$ qrsh -V -pe orte 4 mpirun hostname
 
# Use qstat(1) to show the status of SGE jobs and queues
shell$ qstat -f

In reference to the setup, be sure you have a Parallel Environment (PE) defined for submitting parallel jobs. You don't have to name your PE "orte". The following example shows a PE named "orte" that would look like:

shell$ qconf -sp orte
   pe_name            orte
   slots              99999
   user_lists         NONE
   xuser_lists        NONE
   start_proc_args    NONE
   stop_proc_args     NONE
   allocation_rule    $fill_up
   control_slaves     TRUE
   job_is_first_task  FALSE
   urgency_slots      min
   accounting_summary FALSE
   qsort_args         NONE

"qsort_args" is necessary with the Son of Grid Engine distribution, version 8.1.1 and later, and probably only applicable to it. For very old versions of SGE, omit "accounting_summary" too.

You may want to alter other parameters, but the important one is "control_slaves", specifying that the environment has "tight integration". Note also the lack of a start or stop procedure. The tight integration means that mpirun automatically picks up the slot count to use as a default in place of the "-np" argument, picks up a host file, spawns remote processes via "qrsh" so that SGE can control and monitor them, and creates and destroys a per-job temporary directory ($TMPDIR), in which Open MPI's directory will be created (by default).

Be sure the queue will make use of the PE that you specified:

shell$ qconf -sq all.q
...
pe_list               make cre orte
...

To determine whether the SGE parallel job is successfully launched to the remote nodes, you can pass in the MCA parameter "[--mca plm_base_verbose 1]" to mpirun.

This will add in a -verbose flag to the qrsh -inherit command that is used to send parallel tasks to the remote SGE execution hosts. It will show whether the connections to the remote hosts are established successfully or not.

Various SGE documentation with pointers to more is available at the Son of GridEngine site, and configuration instructions can be found at the Son of GridEngine configuration how-to site.

205. Does the SGE tight integration support the -notify flag to qsub?

If you are running SGE6.2 Update 3 or later, then the -notify flag is supported. If you are running earlier versions, then the -notify flag will not work and using it will cause the job to be killed.

To use -notify, one has to be careful. First, let us review what -notify does. Here is an excerpt from the qsub man page for the -notify flag.

-notify: This flag, when set causes Sun Grid Engine to send
warning signals to a running job prior to sending the
signals themselves. If a SIGSTOP is pending, the job
will receive a SIGUSR1 several seconds before the SIGSTOP.
If a SIGKILL is pending, the job will receive a SIGUSR2
several seconds before the SIGKILL. The amount of time
delay is controlled by the notify parameter in each
queue configuration.

Let us assume the reason you want to use the -notify flag is to get the SIGUSR1 signal prior to getting the SIGTSTP signal. As mentioned in this this FAQ entry one could run the job as shown in this batch script.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
mpirun -np 16 -mca orte_forward_job_control 1 a.out

However, one has to make one of two changes to this script for things to work properly. By default, a SIGUSR1 signal will kill a shell script. So we have to make sure that does not happen. Here is one way to handle it.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
exec mpirun -np 16 -mca orte_forward_job_control 1 a.out

Alternatively, one can catch the signals in the script instead of doing an exec on the mpirun.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
 
function sigusr1handler()
{
        echo "SIGUSR1 caught by shell script" 1>&2
}
 
function sigusr2handler()
{
        echo "SIGUSR2 caught by shell script" 1>&2
}
 
trap sigusr1handler SIGUSR1
trap sigusr2handler SIGUSR2
 
mpirun -np 16 -mca orte_forward_job_control 1 a.out

206. Can I suspend and resume my job?

A new feature was added into Open MPI v1.3.1 that supports suspend/resume of an MPI job. To suspend the job, you send a SIGTSTP (not SIGSTOP) signal to mpirun. mpirun will catch this signal and forward it to the a.outs as a SIGSTOP signal. To resume the job, you send a SIGCONT signal to mpirun which will be caught and forwarded to the a.outs.

By default, this feature is not enabled. This means that both the SIGTSTP and SIGCONT signals will simply be consumed by the mpirun process. To have them forwarded, you have to run the job with [--mca orte_forward_job_control 1]. Here is an example on Solaris.

1	shell$ mpirun -mca orte_forward_job_control 1 -np 2 a.out

In another window, we suspend and continue the job.

shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15305 rolfv     158M   22M cpu1     0    0   0:00:21 5.9% a.out/1
 15303 rolfv     158M   22M cpu2     0    0   0:00:21 5.9% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1
 
shell$ kill -TSTP 15301
shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15303 rolfv     158M   22M stop    30    0   0:01:44  21% a.out/1
 15305 rolfv     158M   22M stop    20    0   0:01:44  21% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1
 
shell$ kill -CONT 15301
shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15305 rolfv     158M   22M cpu1     0    0   0:02:06  17% a.out/1
 15303 rolfv     158M   22M cpu3     0    0   0:02:06  17% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1

Note that all this does is stop the a.out processes. It does not, for example, free any pinned memory when the job is in the suspended state.

To get this to work under the SGE environment, you have to change the suspend_method entry in the queue. It has to be set to SIGTSTP. Here is an example of what a queue should look like.

shell$ qconf -sq all.q
qname                 all.q
[...snip...]
starter_method        NONE
suspend_method        SIGTSTP
resume_method         NONE

Note that if you need to suspend other types of jobs with SIGSTOP (instead of SIGTSTP) in this queue then you need to provide a script that can implement the correct signals for each job type.

207. How do I run jobs under Slurm?

The short answer is yes, provided you configured OMPI --with-slurm. You can use mpirun as normal, or directly launch your application using srun if OMPI is configured per this FAQ entry.

The longer answer is that Open MPI supports launching parallel jobs in all three methods that Slurm supports (you can find more info about Slurm specific recommendations on the SchedMD web page:

Launching via "salloc ..."
Launching via "sbatch ..."
Launching via "srun -n X my_mpi_application"

Specifically, you can launch Open MPI's mpirun in an interactive Slurm allocation (via the salloc command) or you can submit a script to Slurm (via the sbatch command), or you can "directly" launch MPI executables via srun.

Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Slurm directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will also use Slurm-native mechanisms to launch and kill processes (rsh and/or ssh are not required).

For example:

# Allocate a Slurm job with 4 nodes
shell$ salloc -N 4 sh
# Now run an Open MPI job on all the nodes allocated by Slurm
# (Note that you need to specify -np for the 1.0 and 1.1 series;
# the -np value is inferred directly from Slurm starting with the
# v1.2 series)
shell$ mpirun my_mpi_application

This will run the 4 MPI processes on the nodes that were allocated by Slurm. Equivalently, you can do this:

1 2	# Allocate a Slurm job with 4 nodes and run your MPI application in it shell$ salloc -N 4 mpirun my_mpi_aplication

Or, if submitting a script:

shell$ cat my_script.sh
#!/bin/sh
mpirun my_mpi_application
shell$ sbatch -N 4 my_script.sh
srun: jobid 1234 submitted
shell$

208. Does Open MPI support "srun -n X my_mpi_application"?

Yes, if you have configured OMPI --with-pmi=foo, where foo is the path to the directory where pmi.h/pmi2.h is located. Slurm (> 2.6, > 14.03) installs PMI-2 support by default.

Older versions of Slurm install PMI-1 by default. If you desire PMI-2, Slurm requires that you manually install that support. When the --with-pmi option is given, OMPI will automatically determine if PMI-2 support was built and use it in place of PMI-1.

209. I use Slurm on a cluster with the OpenFabrics network stack. Do I need to do anything special?

Yes. You need to ensure that Slurm sets up the locked memory limits properly. Be sure to see this FAQ entry about locked memory and this FAQ entry for references about Slurm.

210. My job fails / performs poorly when using mpirun under Slurm 20.11

There were some changes in Slurm behavior that were introduced in Slurm 20.11.0 and subsequently reverted out in Slurm 20.11.3.

SchedMD (the makers of Slurm) strongly suggest that all Open MPI users avoid using Slurm versions 20.11.0 through 20.11.2.

Indeed, you will likely run into problems using just about any version of Open MPI these problematic Slurm releases. Please either downgrade to an older version or upgrade to a newer version of Slurm.

211. How do I reduce startup time for jobs on large clusters?

There are several ways to reduce the startup time on large clusters. Some of them are described on this page. We continue to work on making startup even faster, especially on the large clusters coming in future years.

Open MPI v5.0.3 is significantly faster and more robust than its predecessors. We recommend that anyone running large jobs and/or on large clusters make the upgrade to the v5.0 series.

Several major launch time enhancements have been made starting with the v3.0 release. Most of these take place in the background — i.e., there is nothing you (as a user) need do to take advantage of them. However, there are a few that are left as options until we can assess any potential negative impacts on different applications. Some options are only available when launching via mpirun - these include:

adding --fwd-mpirun-port to the cmd line (or the corresponding fwd_mpirun_port MCA parameter) will allow the daemons launched on compute nodes to wireup to each other using an overlay network (e.g., a tree-based pattern). This reduces the number of socket connections mpirun must handle and can significantly reduce startup time.

Other options are available when launching via mpirun or when launching using the native resource manager launcher (e.g., srun in a Slurm environment). These are activated by setting the corresponding MCA parameter, and include:

Setting the pmix_base_async_modex MCA parameter will eliminate a global out-of-band collective operation during MPI_Init. This operation is performed in order to share endpoint information prior to communication. At scale, this operation can take some time and scales at best logarithmically. Setting the parameter bypasses the operation and causes the system to lookup the endpoint information for a peer only at first message. Thus, instead of collecting endpoint information for all processes, only the endpoint information for those processes this peer communicates with will be retrieved. The parameter is especially effective for applications with sparse communication patterns — i.e., where a process only communicates with a few other peers. Applications that use dense communication patterns (i.e., where a peer communicates directly to all other peers in the job) will probably see a negative impact of this option.
NOTE: This option is only available in PMIx-supporting environments, or when launching via mpirun

The async_mpi_init parameter is automatically set to true when the pmix_base_async_modex parameter has been set, but can also be independently controlled. When set to true, this parameter causes MPI_Init to skip an out-of-band barrier operation at the end of the procedure that is not required whenever direct retrieval of endpoint information is being used.

Similarly, the async_mpi_finalize parameter skips an out-of-band barrier operation usually performed at the beginning of MPI_Finalize. Some transports (e.g., the usnic BTL) require this barrier to ensure that all MPI messages are completed prior to finalizing, while other transports handle this internally and thus do not require the additional barrier. Check with your transport provider to be sure, or you can experiment to determine the proper setting.

212. Where should I put my libraries: Network vs. local filesystems?

Open MPI itself doesn't really care where its libraries are stored. However, where they are stored does have an impact on startup times, particularly for large clusters, which can be mitigated somewhat through use of Open MPI's configuration options.

Startup times will always be minimized by storing the libraries local to each node, either on local disk or in RAM-disk. The latter is sometimes problematic since the libraries do consume some space, thus potentially reducing memory that would have been available for MPI processes.

There are two main considerations for large clusters that need to place the Open MPI libraries on networked file systems:

While DSO's are more flexible, you definitely do not want to use them when the Open MPI libraries will be mounted on a network file system! Doing so will lead to significant network traffic and delayed start times, especially on clusters with a large number of nodes. Instead, be sure to configure your build with --disable-dlopen. This will include the DSO's in the main libraries, resulting in much faster startup times.

Many networked file systems use automount for user level directories, as well as for some locally administered system directories. There are many reasons why system administrators may choose to automount such directories. MPI jobs, however, tend to launch very quickly, thereby creating a situation wherein a large number of nodes will nearly simultaneously demand automount of a specific directory. This can overload NFS servers, resulting in delayed response or even failed automount requests.
Note that this applies to both automount of directories containing Open MPI libraries as well as directories containing user applications. Since these are unlikely to be the same location, multiple automount requests from each node are possible, thus increasing the level of traffic.

213. Static vs shared libraries?

It is perfectly fine to use either shared or static libraries. Shared libraries will save memory when operating multiple processes per node, especially on clusters with high numbers of cores on a node, but can also take longer to launch on networked file systems. (See the network vs. local filesystem FAQ entry for suggestions on how to mitigate such problems.)

214. How do I reduce the time to wireup OMPI's out-of-band communication system?

Open MPI's run-time uses an out-of-band (OOB) communication subsystem to pass messages during the launch, initialization, and termination stages for the job. These messages allow mpirun to tell its daemons what processes to launch, and allow the daemons in turn to forward stdio to mpirun, update mpirun on process status, etc.

The OOB uses TCP sockets for its communication, with each daemon opening a socket back to mpirun upon startup. In a large cluster, this can mean thousands of connections being formed on the node where mpirun resides, and requires that mpirun actually process all these connection requests. mpirun defaults to processing connection requests sequentially — so on large clusters, a backlog can be created that can cause remote daemons to timeout waiting for a response.

Fortunately, Open MPI provides an alternative mechanism for processing connection requests that helps alleviate this problem. Setting the MCA parameter oob_tcp_listen_mode to listen_thread causes mpirun to startup a separate thread dedicated to responding to connection requests. Thus, remote daemons receive a quick response to their connection request, allowing mpirun to deal with the message as soon as possible.

This parameter can be included in the default MCA parameter file, placed in the user's environment, or added to the mpirun command line. See this FAQ entry for more details on how to set MCA parameters.

215. Why is my job failing because of file descriptor limits?

This is a known issue in Open MPI releases prior to the v1.3 series. The problem lies in the connection topology for Open MPI's out-of-band (OOB) communication subsystem. Prior to the 1.3 series, a fully-connected topology was used that required every process to open a connection to every other process in the job. This can rapidly overwhelm the usual system limits.

There are two methods you can use to circumvent the problem. First, upgrade to the v1.3 series if you can — this would be our recommended approach as there are considerable improvements in that series vs. the v1.2 one.

If you cannot upgrade and must stay with the v1.2 series, then you need to increase the number of file descriptors in your system limits. This commonly requires that your system administrator increase the number of file descriptors allowed by the system itself. The number required depends both on the number of nodes in your cluster and the max number of processes you plan to run on each node. Assuming you want to allow jobs that fully occupy the cluster, than the minimum number of file descriptors you will need is roughly (#procs_on_a_node+1) * #procs_in_the_job.

It is always wise to have a few extra just in case. :-)

Note that this only covers the file descriptors needed for the out-of-band communication subsystem. It specifically does not address file descriptors needed to support the MPI TCP transport, if that is being used on your system. If it is, then additional file descriptors will be required for those TCP sockets. Unfortunately, a simple formula cannot be provided for that value as it depends completely on the number of point-to-point TCP connections being made. If you believe that users may want to fully connect an MPI job via TCP, then it would be safest to simply double the number of file descriptors calculated above.

This can, of course, get to be a really big number...which is why you might want to consider upgrading to the v1.3 series, where OMPI only opens #nodes OOB connections on each node. We are currently working on even more sparsely connected topologies for very large clusters, with the goal of constraining the number of connections opened on a node to an arbitrary number as specified by an MCA parameter.

216. I know my cluster's configuration - how can I take advantage of that knowledge?

Clusters rarely change from day-to-day, and large clusters rarely change at all. If you know your cluster's configuration, there are several steps you can take to both reduce Open MPI's memory footprint and reduce the launch time of large-scale applications. These steps use a combination of build-time configuration options to eliminate components — thus eliminating their libraries and avoiding unnecessary component open/close operations — as well as run-time MCA parameters to specify what modules to use by default for most users.

One way to save memory is to avoid building components that will actually never be selected by the system. Unless MCA parameters specify which components to open, built components are always opened and tested as to whether or not they should be selected for use. If you know that a component can build on your system, but due to your cluster's configuration will never actually be selected, then it is best to simply configure OMPI to not build that component by using the --enable-mca-no-build configure option.

For example, if you know that your system will only utilize the ob1 component of the PML framework, then you can no_build all the others. This not only reduces memory in the libraries, but also reduces memory footprint that is consumed by Open MPI opening all the built components to see which of them can be selected to run.

In some cases, however, a user may optionally choose to use a component other than the default. For example, you may want to build all of the routed framework components, even though the vast majority of users will simply use the default binomial component. This means you have to allow the system to build the other components, even though they may rarely be used.

You can still save launch time and memory, though, by setting the routed=binomial MCA parameter in the default MCA parameter file. This causes OMPI to not open the other components during startup, but allows users to override this on their command line or in their environment so no functionality is lost — you just save some memory and time.

Rather than have to figure this all out by hand, we are working on a new OMPI tool called ompi-profiler. When run on a cluster, it will tell you the selection results of all frameworks — i.e., for each framework on each node, which component was selected to run — and a variety of other information that will help you tailor Open MPI for your cluster.

Stay tuned for more info as we continue to work on ways to improve your performance...

217. What is the Modular Component Architecture (MCA)?

The Modular Component Architecture (MCA) is the backbone for much of Open MPI's functionality. It is a series of frameworks, components, and modules that are assembled at run-time to create an MPI implementation.

Frameworks: An MCA framework manages zero or more components at run-time and is targeted at a specific task (e.g., providing MPI collective operation functionality). Each MCA framework supports a single component type, but may support multiple versions of that type. The framework uses the services from the MCA base functionality to find and/or load components.

Components: An MCA component is an implementation of a framework's interface. It is a standalone collection of code that can be bundled into a plugin that can be inserted into the Open MPI code base, either at run-time and/or compile-time.

Modules: An MCA module is an instance of a component (in the C++ sense of the word "instance"; an MCA component is analogous to a C++ class). For example, if a node running an Open MPI application has multiple ethernet NICs, the Open MPI application will contain one TCP MPI point-to-point component, but two TCP point-to-point modules.

Frameworks, components, and modules can be dynamic or static. That is, they can be available as plugins or they may be compiled statically into libraries (e.g., libmpi).

218. What are MCA parameters?

MCA parameters are the basic unit of run-time tuning for Open MPI. They are simple "key = value" pairs that are used extensively throughout the code base. The general rules of thumb that the developers use are:

Instead of using a constant for an important value, make it an MCA parameter.
If a task can be implemented in multiple, user-discernible ways, implement as many as possible and make choosing between them be an MCA parameter.

For example, an easy MCA parameter to describe is the boundary between short and long messages in TCP wire-line transmissions. "Short" messages are sent eagerly whereas "long" messages use a rendezvous protocol. The decision point between these two protocols is the overall size of the message (in bytes). By making this value an MCA parameter, it can be changed at run-time by the user or system administrator to use a sensible value for a particular environment or set of hardware (e.g., a value suitable for 100 Mbps Ethernet is probably not suitable for Gigabit Ethernet, and may require a different value for 10 Gigabit Ethernet).

Note that MCA parameters may be set in several different ways (described in another FAQ entry). This allows, for example, system administrators to fine-tune the Open MPI installation for their hardware / environment such that normal users can simply use the default values.

More specifically, HPC environments — and the applications that run on them — tend to be unique. Providing extensive run-time tuning capabilities through MCA parameters allows the customization of Open MPI to each system's / user's / application's particular needs.

219. What frameworks are in Open MPI?

There are three types of frameworks in Open MPI: those in the MPI layer (OMPI), those in the run-time layer (ORTE), and those in the operating system / platform layer (OPAL).

The specific list of frameworks varies between each major release series of Open MPI. See the links below to FAQ entries for specific versions of Open MPI:

The Open MPI v1.2 series (and prior)
The Open MPI v1.3 series
The Open MPI v1.4 series (and later)

220. What frameworks are in Open MPI v1.2 (and prior)?

The comprehensive list of frameworks in Open MPI is continually being augmented. As of August 2005, here is the current list:

OMPI frameworks

allocator: Memory allocator
bml: BTL management layer (managing multiple devices)
btl: Byte transfer layer (point-to-point byte movement)
coll: MPI collective algorithms
io: MPI-2 I/O functionality
mpool: Memory pool management
pml: Point-to-point management layer (fragmenting, reassembly, top-layer protocols, etc.)
osc: MPI-2 one-sided communication
ptl: (outdated / deprecated) MPI point-to-point transport layer
rcache: Memory registration management
topo: MPI topology information

ORTE frameworks

errmgr: Error manager
gpr: General purpose registry
iof: I/O forwarding
ns: Name server
oob: Out-of-band communication
pls: Process launch subsystem
ras: Resource allocation subsystem
rds: Resource discovery subsystem
rmaps: Resource mapping subsystem
rmgr: Resource manager (upper meta layer for all other Resource frameworks)
rml: Remote messaging layer (routing of OOB messages)
schema: Name schemas
sds: Startup discovery services
soh: State of health

OPAL frameworks

maffinity: Memory affinity
memory: Memory hooks
paffinity: Processor affinity
timer: High-resolution timers

221. What frameworks are in Open MPI v1.3?

The comprehensive list of frameworks in Open MPI is continually being augmented. As of November 2008, here is the current list in the Open MPI v1.3 series:

OMPI frameworks

allocator: Memory allocator
bml: BTL management layer
btl: MPI point-to-point Byte Transfer Layer, used for MPI point-to-point messages on some types of networks
coll: MPI collective algorithms
crcp: Checkpoint/restart coordination protocol
dpm: MPI-2 dynamic process management
io: MPI-2 I/O
mpool: Memory pooling
mtl: Matching transport layer, used for MPI point-to-point messages MPI-2 one-sided communications
pml: MPI point-to-point management layer
pubsub: MPI-2 publish/subscribe management
rcache: Memory registration cache
topo: MPI topology routines

ORTE frameworks

errmgr: RTE error manager
ess: RTE environment-specific services
filem: Remote file management
grpcomm: RTE group communications
iof: I/O forwarding
odls: OpenRTE daemon local launch subsystem
oob: Out of band messaging
plm: Process lifecycle management
ras: Resource allocation system
rmaps: Resource mapping system
rml: RTE message layer
routed: Routing table for the RML
snapc: Snapshot coordination

OPAL frameworks

backtrace: Debugging call stack backtrace support
carto: Cartography (host/network mapping) support
crs: Checkpoint and restart service
installdirs: Installation directory relocation services
maffinity: Memory affinity
memchecker: Run-time memory checking
memcpy: Memcpy copy support
memory: Memory management hooks
paffinity: Processor affinity
timer: High-resolution timers

222. What frameworks are in Open MPI v1.4 (and later)?

The comprehensive list of frameworks in Open MPI tends to change over time. The README file in each Open MPI version maintains a list of the frameworks that are contained in that version.

It is best to consult that README file; it is kept up to date.

223. How do I know what components are in my Open MPI installation?

The ompi_info command, in addition to providing a wealth of configuration information about your Open MPI installation, will list all components (and the frameworks that they belong to) that are available. These include system-provided components as well as user-provided components.

Please note that starting with Open MPI v1.8, ompi_info categorizes its parameter parameters in so-called levels, as defined by the MPI_T interface. You will need to specify --level 9 (or --all) to show all MCA parameters. See Jeff Squyres' Blog for further information.

224. How do I install my own components into an Open MPI installation?

By default, Open MPI looks in two places for components at run-time (in order):

$prefix/lib/openmpi/: This is the system-provided components directory, part of the installation tree of Open MPI itself.
$HOME/.openmpi/components/: This is where users can drop their own components that will automatically be "seen" by Open MPI at run-time. This is ideal for developmental, private, or otherwise unstable components.

Note that the directories and search ordering used for finding components in Open MPI is, itself, an MCA parameter. Setting the mca_component_path changes this value (a colon-delimited list of directories).

Note also that components are only used on nodes where they are "visible". Hence, if your $prefix/lib/openmpi/ is a directory on a local disk that is not shared via a network filesystem to other nodes where you run MPI jobs, then components that are installed to that directory will only be used by MPI jobs running on the local node.

More specifically: components have the same visibility as normal files. If you need a component to be available to all nodes where you run MPI jobs, then you need to ensure that it is visible on all nodes (typically either by installing it on all nodes for non-networked filesystem installs, or by installing them in a directory that is visibile to all nodes via a networked filesystem). Open MPI does not automatically send components to remote nodes when MPI jobs are run.

225. How do I know what MCA parameters are available?

The ompi_info command can list the parameters for a given component, all the parameters for a specific framework, or all parameters. Most parameters contain a description of the parameter; all will show the parameter's current value.

For example:

# Starting with Open MPI v1.7, you must use "--level 9" to see
# all the MCA parameters (the default is "--level 1"):
shell$ ompi_info --param all all --level 9
 
# Before Open MPI v1.7, the "--level" command line options
# did not exist; do not use it.
shell$ ompi_info --param all all

Shows all the MCA parameters for all components that ompi_info finds, whereas:

1
2
3

# All remaining examples assume Open MPI v1.7 or later (i.e.,
# they assume the use of the "--level" command line option)
shell$ ompi_info --param btl all --level 9

Shows all the MCA parameters for all BTL components that ompi_info finds. Finally:

1	shell$ ompi_info --param btl tcp --level 9

Shows all the MCA parameters for the TCP BTL component.

226. How do I set the value of MCA parameters?

There are three main ways to set MCA parameters, each of which are searched in order.

Command line: The highest-precedence method is setting MCA parameters on the command line. For example:
1
shell$ mpirun --mca mpi_show_handle_leaks 1 -np 4 a.out
This sets the MCA parameter mpi_show_handle_leaks to the value of 1 before running a.out with four processes. In general, the format used on the command line is "--mca <param_name> <value>".
Note that when setting multi-word values, you need to use quotes to ensure that the shell and Open MPI understand that they are a single value. For example:
1
shell$ mpirun --mca param "value with multiple words" ...

Environment variable: Next, environment variables are searched. Any environment variable named OMPI_MCA_<param_name> will be used. For example, the following has the same effect as the previous example (for sh-flavored shells):

1
2
3

shell$ OMPI_MCA_mpi_show_handle_leaks=1
shell$ export OMPI_MCA_mpi_show_handle_leaks
shell$ mpirun -np 4 a.out

Or, for csh-flavored shells:

1 2	shell% setenv OMPI_MCA_mpi_show_handle_leaks 1 shell% mpirun -np 4 a.out

Note that setting environment variables to values with multiple words requires quoting, such as:

# sh-flavored shells
shell$ OMPI_MCA_param="value with multiple words"
 
# csh-flavored shells
shell% setenv OMPI_MCA_param "value with multiple words"

Aggregate MCA parameter files: Simple text files can be used to set MCA parameter values for a specific application. See this FAQ entry (Open MPI version 1.3 and higher).

Files: Finally, simple text files can be used to set MCA parameter values. Parameters are set one per line (comments are permitted). For example:
1 2 3
# This is a comment # Set the same MCA parameter as in previous examples mpi_show_handle_leaks = 1
Note that quotes are not necessary for setting multi-word values in MCA parameter files. Indeed, if you use quotes in the MCA parameter file, they will be used as part of the value itself. For example:
1 2 3
# The following two values are different: param1 = value with multiple words param2 = "value with multiple words"
By default, two files are searched (in order):
1. $HOME/.openmpi/mca-params.conf: The user-supplied set of values takes the highest precedence.
2. $prefix/etc/openmpi-mca-params.conf: The system-supplied set of values has a lower precedence.
More specifically, the MCA parameter mca_param_files specifies a colon-delimited path of files to search for MCA parameters. Files to the left have lower precedence; files to the right are higher precedence.
Keep in mind that, just like components, these parameter files are only relevant where they are "visible" (see this FAQ entry). Specifically, Open MPI does not read all the values from these files during startup and then send them to all nodes in the job — the files are read on each node during each process' startup. This is intended behavior: it allows for per-node customization, which is especially relevant in heterogeneous environments.

227. What are Aggregate MCA (AMCA) parameter files?

Starting with version 1.3, aggregate MCA (AMCA) parameter files contain MCA parameter key/value pairs similar to the $HOME/.openmpi/mca-params.conf file described in this FAQ entry.

The motivation behind AMCA parameter sets came from the realization that for certain applications a large number of MCA parameters are required for the application to run well and/or as the user expects. Since these MCA parameters are application specific (or even application run specific) they should not be set in a global manner, but only pulled in as determined by the user.

MCA parameters set in AMCA parameter files will override any MCA parameters supplied in global parameter files (e.g., $HOME/.openmpi/mca-params.conf), but not command line or environment parameters.

AMCA parameter files are typically supplied on the command line via the --am option.

For example, consider an AMCA parameter file called foo.conf placed in the same directory as the application a.out. A user will typically run the application as:

1	shell$ mpirun -np 2 a.out

To use the foo.conf AMCA parameter file this command line changes to:

1	shell$ mpirun -np 2 --am foo.conf a.out

If the user wants to override a parameter set in foo.conf they can add it to the command line as seen below.

1	shell$ mpirun -np 2 --am foo.conf -mca btl tcp,self a.out

AMCA parameter files can be coupled if more than one file is to be used. If we have another AMCA parameter file called bar.conf that we want to use, we add it to the command line as follows:

1	shell$ mpirun -np 2 --am foo.conf:bar.conf a.out

AMCA parameter files are loaded in priority order. This means that foo.conf AMCA file has priority over the bar.conf file. So if the bar.conf file sets the MCA parameter mpi_leave_pinned=0 and the foo.conf file sets this MCA parameter to mpi_leave_pinned=1 then the latter will be used.

The location of AMCA parameter files are resolved in a similar way as the shell. If no path operator is provided (i.e., foo.conf) then Open MPI will search the $SYSCONFDIR/amca-param-sets directory, then the current working directory. If a relative path is specified, then only that path will be searched (e.g., ./foo.conf, baz/foo.conf). If an absolute path is specified, then only that path will be searched (e.g., /bip/boop/foo.conf).

Though the typical use case for AMCA parameter files is to be specified on the command line, they can also be set as MCA parameters in the environment. The MCA parameter mca_base_param_file_prefix contains a ':' separated list of AMCA parameter files exactly as they would be passed to the --am command line option. The MCA parameter mca_base_param_file_path specifies the path to search for AMCA files with relative paths. By default this is $SYSCONFDIR/amca-param-sets/:$CWD.

228. How do I set application specific environment variables in global parameter files?

Starting with OMPI version 1.9, the --am option to supply AMCA parameter files (see this FAQ entry) is deprecated. Users should instead use the ---tune option. This option allows one to specify both mca parameters and environment variables from within a file using the same command line syntax.

The usage of the --tune option is the same as that for the --am option except that --tune requires a single file or a comma delimited list of files, while a colon delimiter is used with the --am option.

A valid line in the file may contain zero or many -x, -mca, or --mca arguments. If any argument is duplicated in the file, the last value read will be used.

Fox example, a file may contain the following line:

1	-x envar1 = value1 -mca param1 value1 -x envar2 -mca param2 "value2"

To use the foo.conf parameter file in order to run a.out the command line looks as the following

1	shell$ mpirun -np 2 --tune foo.conf a.out

Similar to --am option, MCA parameters and environment specified on the command line have higher precedence than variables specified in the file.

The --tune option can also be replaced by the MCA parameter mca_base_envar_file_prefix which is similar to mca_base_param_file_prefix having the same meaning as the --am option.

229. How do I select which components are used?

Each MCA framework has a top-level MCA parameter that helps guide which components are selected to be used at run-time. Specifically, there is an MCA parameter of the same name as each MCA framework that can be used to include or exclude components from a given run.

For example, the btl MCA parameter is used to control which BTL components are used (e.g., MPI point-to-point communications; see this FAQ entry for a full list of MCA frameworks). It can take as a value a comma-separated list of components with the optional prefix "^". For example:

# Tell Open MPI to exclude the tcp and openib BTL components
# and implicitly include all the rest
shell$ mpirun --mca btl ^tcp,openib ...
 
# Tell Open MPI to include *only* the components listed here and
# implicitly ignore all the rest (i.e., the loopback, shared memory,
# and OpenFabrics (a.k.a., "OpenIB") MPI point-to-point components):
shell$ mpirun --mca btl self,sm,openib ...

Note that ^ can only be the prefix of the entire value because the inclusive and exclusive behavior are mutually exclusive. Specifically, since the exclusive behavior means "use all components except these", it does not make sense to mix it with the inclusive behavior of not specifying it (i.e., "use all of these components"). Hence, something like this:

1	shell$ mpirun --mca btl self,sm,openib,^tcp ...

does not make sense because it says both "use only the self, sm, and openib components" and "use all components except tcp" and will result in an error.

Just as with all MCA parameters, the btl parameter (and all framework parameters) can be set in multiple different ways.

230. What is processor affinity? Does Open MPI support it?

Open MPI supports processor affinity on a variety of systems through process binding, in which each MPI process, along with its threads, is "bound" to a specific subset of processing resources (cores, sockets, etc.). That is, the operating system will constrain that process to run on only that subset. (Other processes might be allowed on the same resources.)

Affinity can improve performance by inhibiting excessive process movement — for example, away from "hot" caches or NUMA memory. Judicious bindings can improve performance by reducing resource contention (by spreading processes apart from one another) or improving interprocess communications (by placing processes close to one another). Binding can also improve performance reproducibility by eliminating variable process placement. Unfortunately, binding can also degrade performance by inhibiting the OS capability to balance loads.

You can run the ompi_info command and look for hwloc components to see if your system is supported (older versions of Open MPI used paffinity components). For example:

1 2	$ ompi_info \| grep hwloc MCA hwloc: hwloc191 (MCA v2.0, API v2.0, Component v1.8.4)

Older versions of Open MPI used paffinity components for process affinity control; if your version of Open MPI does not have an hwloc component, see if it has a paffinity component.

Note that processor affinity probably should not be used when a node is over-subscribed (i.e., more processes are launched than there are processors). This can lead to a serious degradation in performance (even more than simply oversubscribing the node). Open MPI will usually detect this situation and automatically disable the use of processor affinity (and display run-time warnings to this effect).

Also see this FAQ entry for how to use processor and memory affinity in Open MPI.

231. What is memory affinity? Does Open MPI support it?

Memory affinity is increasingly relevant on modern servers because most architectures exhibit Non-Uniform Memory Access (NUMA) architectures. In a NUMA architecture, memory is physically distributed throughout the machine even though it is virtually treated as a single address space. That is, memory may be physically local to one or more processors — and therefore remote to other processors.

Simply put: some memory will be faster to access (for a given process) than others.

Open MPI supports general and specific memory affinity, meaning that it generally tries to allocate all memory local to the processor that asked for it. When shared memory is used for communication, Open MPI uses memory affinity to make certain pages local to specific processes in order to minimize memory network/bus traffic.

Open MPI supports memory affinity on a variety of systems.

In recent versions of Open MPI, memory affinity is controlled through the hwloc component. In earlier versions of Open MPI, memory affinity was controlled through maffinity components.

1 2	$ ompi_info \| grep hwloc MCA hwloc: hwloc191 (MCA v2.0, API v2.0, Component v1.8.4)

Older versions of Open MPI used maffinity components for memory affinity control; if your version of Open MPI does not have an hwloc component, see if it has a maffinity component.

Note that memory affinity support is enabled only when processor affinity is enabled. Specifically: using memory affinity does not make sense if processor affinity is not enabled because processes may allocate local memory and then move to a different processor, potentially remote from the memory that it just allocated.

Also see this FAQ entry for how to use processor and memory affinity in Open MPI.

232. How do I tell Open MPI to use processor and/or memory affinity?

Assuming that your system supports processor and memory affinity (check ompi_info for an hwloc component (or, in earlier Open MPI versions, paffinity and maffinity components)), you can explicitly tell Open MPI to use them when running MPI jobs.

Also note that processor and memory affinity is meaningless (but harmless) on uniprocessor machines.

The use of processor and memory affinity has greatly evolved over the life of the Open MPI project. As such, how to enable / use processor and memory affinity in Open MPI strongly depends on which version you are using:

Open MPI v1.6 and beyond
Open MPI v1.5.x
Open MPI v1.4.x (contains introduction to --by* and --bind-to-* options)
Open MPI v1.3.x (contains introduction to rank files)
Open MPI v1.2.x (contains introduction to mpi_paffinity_alone)

233. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.2.x? (What is mpi_paffinity_alone?)

Open MPI 1.2 offers only crude control, with the MCA parameter mpi_paffinity_alone. For example:

1	$ mpirun --mca mpi_paffinity_alone 1 -np 4 a.out

(Just like any other MCA parameter, mpi_paffinity_alone can be set via any of the normal MCA parameter mechanisms.)

On each node where your job is running, your job's MPI processes will be bound, one-to-one, in the order of their global MPI ranks, to the lowest-numbered processing units (for example, cores or hardware threads) on the node as identified by the OS. Further, memory affinity will also be enabled if it is supported on the node, as described in a different FAQ entry.

If multiple jobs are launched on the same node in this manner, they will compete for the same processing units and severe performance degradation will likely result. Therefore, this MCA parameter is best used when you know your job will be "alone" on the nodes where it will run.

Since each process is bound to a single processing unit, performance will likely suffer catastrophically if processes are multi-threaded.

Depending on how processing units on your node are numbered, the binding pattern may be good, bad, or even disastrous. For example, performance might be best if processes are spread out over all processor sockets on the node. The processor ID numbering, however, might lead to mpi_paffinity_alone filling one socket before moving to another. Indeed, on nodes with multiple hardware threads per core (e.g., "HyperThreads", "SMT", etc.), the numbering could lead to multiple processes being bound to a core before the next core is considered. In such cases, you should probably upgrade to a newer version of Open MPI or use a different, external mechanism for processor binding.

Note that Open MPI will automatically disable processor affinity on any node that is oversubscribed (i.e., where more Open MPI processes are launched in a single job on a node than it has processors) and will print out warnings to that effect.

Also note, however, that processor affinity is not exclusionary with Degraded performance mode. Degraded mode is usually only used when oversubscribing nodes (i.e., running more processes on a node than it has processors — see this FAQ entry for more details about oversubscribing, as well as a definition of Degraded performance mode). It is possible manually to select Degraded performance mode and use processor affinity as long as you are not oversubscribing.

234. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.3.x? (What are rank files?)

Open MPI 1.3 supports the mpi_paffinity_alone MCA parameter that is described in this FAQ entry.

Open MPI 1.3 (and higher) also allows a different binding to be specified for each process via a rankfile. Consider the following example:

shell$ cat rankfile
rank 0=host0 slot=2
rank 1=host1 slot=4-7,0
rank 2=host2 slot=1:0
rank 3=host3 slot=1:2-3
shell$ mpirun -np 4 -hostfile hostfile --rankfile rankfile ./my_mpi_application
  <i>or</i>
shell$ mpirun -np 4 -hostfile hostfile --mca rmaps_rank_file_path rankfile ./my_mpi_application

The rank file specifies a host node and slot list binding for each MPI process in your job. Note:

Typically, the slot list is a comma-delimited list of ranges. The numbering is OS/BIOS-dependent and refers to the finest grained processing units identified by the OS — for example, cores or hardware threads.
Alternatively, a colon can be used in the slot list for socket:core designations. For example, 1:2-3 means cores 2-3 of socket 1.
It is strongly recommended that you provide a full rankfile when using such affinity settings, otherwise there would be a very high probability of processor oversubscription and performance degradation.
The hosts specified in the rankfile must be known to mpirun, for example, via a list of hosts in a hostfile or as obtained from a resource manager.
The number of processes np must be provided on the mpirun command line.
If some processing units are not available — e.g., due to unpopulated sockets, idled cores, or BIOS settings — the syntax assumes a logical numbering in which numbers are contiguous despite the physical gaps. You may refer to actual physical numbers with a "p" prefix. For example, rank 4=host3 slot=p3:2 will bind rank4 to the physical socket3 : physical core2 pair.

Rank files are also discussed on the mpirun man page.

If you want to use the same slot list binding for each process, presumably in cases where there is only one process per node, you can specify this slot list on the command line rather than having to use a rank file:

1	shell$ mpirun -np 4 -hostfile hostfile --slot-list 0:1 ./my_mpi_application

Remember, every process will use the same slot list. If multiple processes run on the same host, they will bind to the same resources — in this case, socket0:core1, presumably oversubscribing that core and ruining performance.

Slot lists can be used to bind to multiple slots, which would be helpful for multi-threaded processes. For example:

Two threads per process: rank 0=host1 slot=0,1
Four threads per process: rank 0=host1 slot=0,1,2,3

Note that no thread will be bound to a specific slot within the list. OMPI only supports process level affinity; each thread will be bound to all of the slots within the list.

235. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.4.x? (How do I use the --by* and --bind-to-* options?)

Open MPI 1.4 supports all the same processor affinity controls as Open MPI v1.3, but also supports additional command-line binding switches to mpirun:

--bind-to-none: Do not bind processes. (Default)
--bind-to-core: Bind each MPI process to a core.
--bind-to-socket: Bind each MPI process to a processor socket.
--report-bindings: Report how the launched processes were bound by Open MPI.

In the case of cores with multiple hardware threads (e.g., "HyperThreads" or "SMT"), only the first hardware thread on each core is used with the --bind-to-* options. This will hopefully be fixed in the Open MPI v1.5 series.

The above options are typically most useful when used with the following switches that indicate how processes are to be laid out in the MPI job. To be clear: *if the following options are used without a --bind-to-* option, they only have the effect of deciding which node a process will run on. Only the --bind-to- options actually bind a process to a specific (set of) hardware resource(s).

--byslot: Alias for --bycore.
--bycore: When laying out processes, put sequential MPI processes on adjacent processor cores. *(Default)*
--bysocket: When laying out processes, put sequential MPI processes on adjacent processor sockets.
--bynode: When laying out processes, put sequential MPI processes on adjacent nodes.

Note that --bycore and --bysocket lay processes out in terms of the actual hardware rather than by some node-dependent numbering, which is what mpi_paffinity_alone does as described in this FAQ entry.

Finally, there is a poorly-named "combination" option that effects both process layout counting and binding: --cpus-per-proc (and an even more poorly-named alias --cpus-per-rank).

Editor's note: I feel that these options are poorly named for two reasons: 1) "cpu" is not consistently defined (i.e., it may be a core, or may be a hardware thread, or it may be something else), and 2) even though many users use the terms "rank" and "MPI process" interchangeably, they are NOT the same thing.

This option does the following:

Takes an integer argument (ncpus) that indicates how many operating system processor IDs (which may be cores or may be hardware threads) should be bound to each MPI process.
Allocates and binds ncpus OS processor IDs to each MPI process. For example, on a machine with 4 processor sockets, each with 4 processor cores, each with one hardware thread:
1
shell$ mpirun -np 8 --cpus-per-proc 2 my_mpi_process
This command will bind each MPI process to ncpus=2 cores. All cores on the machine will be used.
Note that ncpus cannot be more than the number of OS processor IDs in a single processor socket. Put loosely: --cpus-per-proc only allows binding to multiple cores/threads within a single socket.

The --cpus-per-proc can also be used with the --bind-to-* options in some cases, but this code is not well tested and may result in unexpected binding behavior. Test carefully to see where processes actually get bound before relying on the behavior for production runs. The --cpus-per-proc and other affinity-related command line options are likely to be revamped some time during the Open MPI v1.5 series.

236. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.5.x?

Open MPI 1.5 currently has the same processor affinity controls as Open MPI v1.4. This FAQ entry is a placemarker for future enhancements to the 1.5 series' processor and memory affinity features.

Stay tuned!

237. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.6 (and beyond)?

The use of processor and memory affinity evolved rapidly, starting with Open MPI version 1.6.

The mpirun(1) man page for each version of Open MPI contains a lot of information about the use of processor and memory affinity. You should consult the mpirun(1) page for your version of Open MPI for detailed information about processor/memory affinity.

238. Does Open MPI support calling fork(), system(), or popen() in MPI processes?

It depends on a lot of factors, including (but not limited to):

The operating system
The underlying compute hardware
The network stack (see this FAQ entry for more details)
Interactions with other middleware in the MPI process

In some cases, Open MPI will determine that it is not safe to fork(). In these cases, Open MPI will register a pthread_atfork() callback to print a warning when the process forks.

This warning is helpful for legacy MPI applications where the current maintainers are unaware that system() or popen() is being invoked from an obscure subroutine nestled deep in millions of lines of Fortran code (we've seen this kind of scenario many times).

However, this atfork handler can be dangerous because there is no way to unregister an atfork handler. Hence, packages that dynamically open Open MPI's libraries (e.g., Python bindings for Open MPI) may fail if they finalize and unload libmpi, but later call fork. The atfork system will try to invoke Open MPI's atfork handler; nothing good can come of that.

For such scenarios, or if you simply want to disable printing the warning, Open MPI can be set to never register the atfork handler with the mpi_warn_on_fork MCA parameter. For example:

1	shell$ mpirun --mca mpi_warn_on_fork 0 ...

Of course, systems that dlopen libmpi may not use Open MPI's mpirun, and therefore may need to use a different mechanism to set MCA parameters.

239. I want to run some performance benchmarks with Open MPI. How do I do that?

Running benchmarks is an extremely difficult task to do correctly. There are many, many factors to take into account; it is not as simple as just compiling and running a stock benchmark application. This FAQ entry is by no means a definitive guide, but it does try to offer some suggestions for generating accurate, meaningful benchmarks.

Decide exactly what you are benchmarking and setup your system accordingly. For example, if you are trying to benchmark maximum performance, then many of the suggestions listed below are extremely relevant (be the only user on the systems and network in question, be the only software running, use processor affinity, etc.). If you're trying to benchmark average performance, some of the suggestions below may be less relevant. Regardless, it is critical to know exactly what you're trying to benchmark, and know (not guess) both your system and the benchmark application itself well enough to understand what the results mean.
To be specific, many benchmark applications are not well understood for exactly what they are testing. There have been many cases where users run a given benchmark application and wrongfully conclude that their system's performance is bad — solely on the basis of a single benchmark that they did not understand. Read the documentation of the benchmark carefully, and possibly even look into the code itself to see exactly what it is testing.
Case in point: not all ping-pong benchmarks are created equal. Most users assume that a ping-pong benchmark is a ping-pong benchmark is a ping-pong benchmark. But this is not true; the common ping-pong benchmarks tend to test subtly different things (e.g., NetPIPE, TCP bench, IMB, OSU, etc.). *Make sure you understand what your benchmark is actually testing.*

Make sure that you are the only user on the systems where you are running the benchmark to eliminate contention from other processes.

Make sure that you are the only user on the entire network / interconnect to eliminate network traffic contention from other processes. This is usually somewhat difficult to do, especially in larger, shared systems. But your most accurate, repeatable results will be achieved when you are the only user on the entire network.

Disable all services and daemons that are not being used. Even "harmless" daemons consume system resources (such as RAM) and cause "jitter" by occasionally waking up, consuming CPU cycles, reading or writing to disk, etc. The optimum benchmark system has an absolute minimum number of system services running.

Use processor affinity on multi-processor/core machines to disallow the operating system from swapping MPI processes between processors (and causing unnecessary cache thrashing, for example).
On NUMA architectures, having the processes getting bumped from one socket to another is more expensive in terms of cache locality (with all of the cache coherency overhead that comes with the lack of it) than in terms of hypertransport routing (see below).
Non-NUMA architectures such as Intel Woodcrest have a flat access time to the South Bridge, but cache locality is still important so CPU affinity is always a good thing to do.

Be sure to understand your system's architecture, particularly with respect to the memory, disk, and network characteristics, and test accordingly. For example, on NUMA architectures, most common being Opteron, the South Bridge is connected through a hypertransport link to one CPU on one socket. Which socket depends on the motherboard, but it should be described in the motherboard documentation (it's not always socket 0!). If a process on the other socket needs to write something to a NIC on a PCIE bus behind the South Bridge, it needs to first hop through the first socket. On modern machines (circa late 2006), this hop cost usually something like 100ns (i.e., 0.1 us). If the socket is further away, like in a 4- or 8-socket configuration, there could potentially be more hops, leading to more latency.

Compile your benchmark with the appropriate compiler optimization flags. With some MPI implementations, the compiler wrappers (like mpicc, mpif90, etc.) add optimization flags automatically. Open MPI does not. Add -O or other flags explicitly.
Make sure your benchmark runs for a sufficient amount of time. Short-running benchmarks are generally less accurate because they take fewer samples; longer-running jobs tend to take more samples.
If your benchmark is trying to benchmark extremely short events (such as the time required for a single ping-pong of messages):
- Perform some "warmup" events first. Many MPI implementations (including Open MPI) — and other subsystems upon which the MPI uses — may use "lazy" semantics to setup and maintain streams of communications. Hence, the first event (or first few events) may well take significantly longer than subsequent events.
- Use a high-resolution timer if possible — gettimeofday() only returns millisecond precision (sometimes on the order of several microseconds).
- Run the event many, many times (hundreds or thousands, depending on the event and the time it takes). Not only does this provide more samples, it may also be necessary, especially when the precision of the timer you're using may be several orders of magnitude less precise than the event you're trying to benchmark.
Decide whether you are reporting minimum, average, or maximum numbers, and have good reasons why.

Accurately label and report all results. Reproducibility is a major goal of benchmarking; benchmark results are effectively useless if they are not precisely labeled as to exactly what they are reporting. Keep a log and detailed notes about the exact system configuration that you are benchmarking. Note, for example, all hardware and software characteristics (to include hardware, firmware, and software versions as appropriate).

240. I am getting a MPI_Win_free error from IMB-EXT — what do I do?

When you run IMB-EXT with Open MPI, you'll see a message like this:

[node01.example.com:2228] *** An error occurred in MPI_Win_free
[node01.example.com:2228] *** on win
[node01.example.com:2228] *** MPI_ERR_RMA_SYNC: error while executing rma sync
[node01.example.com:2228] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

This is due to a bug in the Intel MPI Benchmarks, known to be in at least versions v3.1 and v3.2. Intel was notified of this bug in May of 2009. If you have a version after then, it should include this bug fix. If not, here is the fix that you can apply to the IMB-EXT source code yourself.

Here is a small patch that fixes the bug in IMB v3.2:

And here is the corresponding patch for IMB v3.1:

241. What is the vader BTL?

The vader BTL is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.

Beginning with the v1.8 series, the vader BTL replaces the sm BTL unless the local system lacks the required support or the user specifically requests the latter be used. At this time, vader requires CMA support which is typically found in more current kernels. Thus, systems based on older kernels may default to the slower sm BTL.

242. What is the sm BTL?

The sm BTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.

The sm BTL has high exclusivity. That is, if one process can reach another process via sm, then no other BTL will be considered for that connection.

Note that with Open MPI v1.3.2, the sm so-called "FIFOs" were reimplemented and the sizing of the shared-memory area was changed. So, much of this FAQ will distinguish between releases up to Open MPI v1.3.1 and releases starting with Open MPI v1.3.2.

243. How do I specify use of sm for MPI messages?

Typically, it is unnecessary to do so; OMPI will use the best BTL available for each communication.

Nevertheless, you may use the MCA parameter btl. You should also specify the self BTL for communications between a process and itself. Furthermore, if not all processes in your job will run on the same, single node, then you also need to specify a BTL for internode communications. For example:

1	shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out

244. How does the sm BTL work?

A point-to-point user message is broken up by the PML into fragments. The sm BTL only has to transfer individual fragments. The steps are:

The sender pulls a shared-memory fragment out of one of its free lists. Each process has one free list for smaller (e.g., 4Kbyte) eager fragments and another free list for larger (e.g., 32Kbyte) max fragments.
The sender packs the user-message fragment into this shared-memory fragment, including any header information.
The sender posts a pointer to this shared fragment into the appropriate FIFO (first-in-first-out) queue of the receiver.
The receiver polls its FIFO(s). When it finds a new fragment pointer, it unpacks data out of the shared-memory fragment and notifies the sender that the shared fragment is ready for reuse (to be returned to the sender's free list).

On each node where an MPI job has two or more processes running, the job creates a file that each process mmaps into its address space. Shared-memory resources that the job needs — such as FIFOs and fragment free lists — are allocated from this shared-memory area.

245. Why does my MPI job no longer start when there are too many processes on one node?

If you are using Open MPI v1.3.1 or earlier, it is possible that the shared-memory area set aside for your job was not created large enough. Make sure you're running in 64-bit mode (compiled with -m64) and set the MCA parameter mpool_sm_max_size to be very large — even several Gbytes. Exactly how large is discussed further below.

Regardless of which OMPI release you're using, make sure that there is sufficient space for a large file to back the shared memory — typically in /tmp.

246. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for the sm BTL and sm mpool:

1 2	shell$ ompi_info --param btl sm shell$ ompi_info --param mpool sm

247. How can I tune these parameters to improve performance?

Mostly, the default values of the MCA parameters have already been chosen to give good performance. To improve performance further is a little bit of an art. Sometimes, it's a matter of trading off performance for memory.

btl_sm_eager_limit: If message data plus header information fits within this limit, the message is sent "eagerly" — that is, a sender attempts to write its entire message to shared buffers without waiting for a receiver to be ready. Above this size, a sender will only write the first part of a message, then wait for the receiver to acknowledge its readiness before continuing. Eager sends can improve performance by decoupling senders from receivers.

btl_sm_max_send_size: Large messages are sent in fragments of this size. Larger segments can lead to greater efficiencies, though they could perhaps also inhibit pipelining between sender and receiver.

btl_sm_num_fifos: Starting in Open MPI v1.3.2, this is the number of FIFOs per receiving process. By default, there is only one FIFO per process. Conceivably, if many senders are all sending to the same process and contending for a single FIFO, there will be congestion. If there are many FIFOs, however, the receiver must poll more FIFOs to find incoming messages. Therefore, you might try increasing this parameter slightly if you have many (at least dozens) of processes all sending to the same process. For example, if 100 senders are all contending for a single FIFO for a particular receiver, it may suffice to increase btl_sm_num_fifos from 1 to 2.

btl_sm_fifo_size: Starting in Open MPI v1.3.2, FIFOs could no longer grow. If you believe the FIFO is getting congested because a process falls far behind in reading incoming message fragments, increase this size manually.

btl_sm_free_list_num: This is the initial number of fragments on each (eager and max) free list. The free lists can grow in response to resource congestion, but you can increase this parameter to pre-reserve space for more fragments.

mpool_sm_min_size: You can reserve headroom for the shared-memory area to grow by increasing this parameter.

248. Where is the file that sm will mmap in?

The file will be in the OMPI session directory, which is typically something like /tmp/openmpi-sessions-myusername@mynodename/* . The file itself will have the name shared_mem_pool.mynodename. For example, the full path could be /tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.

To place the session directory in a non-default location, use the MCA parameter orte_tmpdir_base.

249. Why am I seeing incredibly poor performance with the sm BTL?

The most common problem with the shared memory BTL is when the Open MPI session directory is placed on a network filesystem (e.g., if /tmp is not on a local disk). This is because the shared memory BTL places a memory-mapped file in the Open MPI session directory (see this entry for more details). If the session directory is located on a network filesystem, the shared memory BTL latency will be extremely high.

Try not mounting /tmp as a network filesystem, and/or moving the Open MPI session directory to a local filesystem.

Some users have reported success and possible performance optimizations with having /tmp mounted as a "tmpfs" filesystem (i.e., a RAM-based filesystem). However, before configuring your system this way, be aware of a few items:

Open MPI writes a few small meta data files into /tmp and may therefore consume some extra memory that could have otherwise been used for application instruction or data state.
If you use the "filem" system in Open MPI for moving executables between nodes, these files are stored under /tmp.
Open MPI's checkpoint / restart files can also be saved under /tmp.
If the Open MPI job is terminated abnormally, there are some circumstances where files (including memory-mapped shared memory files) can be left in /tmp. This can happen, for example, when a resource manager forcibly kills an Open MPI job and does not give it the chance to clean up /tmp files and directories.

Some users have reported success with configuring their resource manager to run a script between jobs to forcibly empty the /tmp directory.

250. Can I use SysV instead of mmap?

In the v1.3 and v1.4 Open MPI series, shared memory is established via mmap. In future releases, there may be an option for using SysV shared memory.

251. How much shared memory will my job use?

Your job will create a shared-memory area on each node where it has two or more processes. This area will be fixed during the lifetime of your job. Shared-memory allocations (for FIFOs and fragment free lists) will be made in this area. Here, we look at the size of that shared-memory area.

If you want just one hard number, then go with approximately 128 Mbytes per node per job, shared by all the job's processes on that node. That is, an OMPI job will need more than a few Mbytes per node, but typically less than a few Gbytes.

Better yet, read on.

Up through Open MPI v1.3.1, the shared-memory file would basically be sized thusly:

1
2
3

nbytes = n * mpool_sm_per_peer_size
if ( nbytes < mpool_sm_min_size ) nbytes = mpool_sm_min_size
if ( nbytes > mpool_sm_max_size ) nbytes = mpool_sm_max_size

where n is the number of processes in the job running on that particular node and the mpool_sm_* are MCA parameters. For small n, this size is typically excessive. For large n (e.g., 128 MPI processes on the same node), this size may not be sufficient for the job to start.

Starting in OMPI v1.3.2, a more sophisticated formula was introduced to model more closely how much memory was actually needed. That formula is somewhat complicated and subject to change. It guarantees that there will be at least enough shared memory for the program to start up and run. See this FAQ item to see how much is needed. Alternatively, the motivated user can examine the OMPI source code to see the formula used — for example, here is the formula in OMPI commit 463f11f.

OMPI v1.3.2 also uses the MCA parameter mpool_sm_min_size to set a minimum size — e.g., so that there is not only enough shared memory for the job to start, but additionally headroom for further shared-memory allocations (e.g., of more eager or max fragments).

Once the shared-memory area is established, it will not grow further during the course of the MPI job's run.

252. How much shared memory do I need?

In most cases, OMPI will start your job with sufficient shared memory.

Nevertheless, if OMPI doesn't get you enough shared memory (e.g., you're using OMPI v1.3.1 or earlier with roughly 128 processes or more on a single node) or you want to trim shared-memory consumption, you may want to know how much shared memory is really needed.

As we saw earlier, the shared memory area contains:

FIFOs
eager fragments
max fragments

In general, you need only enough shared memory for the FIFOs and fragments that are allocated during MPI_Init().

Beyond that, you might want additional shared memory for performance reasons, so that FIFOs and fragment lists can grow if your program's message traffic encounters resource congestion. Even if there is no room to grow, however, your correctly written MPI program should still run to completion in the face of congestion; performance simply degrades somewhat. Note that while shared-memory resources can grow after MPI_Init(), they cannot shrink.

So, how much shared memory is needed during MPI_Init() ? You need approximately the total of:

FIFOs:
- (≤ Open MPI v1.3.1): 3 × n × n × pagesize
- (≥ Open MPI v1.3.2): n × btl_sm_num_fifos × btl_sm_fifo_size × sizeof(void *)
eager fragments: n × ( 2 × n + btl_sm_free_list_inc ) × btl_sm_eager_limit
max fragments: n × btl_sm_free_list_num × btl_sm_max_send_size

where:

n is the number of MPI processes in your job on the node
pagesize is the OS page size (4KB for Linux and 8KB for Solaris)
btl_sm_* are MCA parameters

253. How can I decrease my shared-memory usage?

There are two parts to this question.

First, how does one reduce how big the mmap file is? The answer is:

Up to Open MPI v1.3.1: Reduce mpool_sm_per_peer_size, mpool_sm_min_size, and mpool_sm_max_size
Starting with Open MPI v1.3.2: Reduce mpool_sm_min_size

Second, how does one reduce how much shared memory is needed? (Just making the mmap file smaller doesn't help if then your job won't start up.) The answers are:

For small values of n — that is, for few processes per node — shared-memory usage during MPI_Init() is predominantly for max free lists. So, you can reduce the MCA parameter btl_sm_max_send_size. Alternatively, you could reduce btl_sm_free_list_num, but it is already pretty small by default.
For large values of n — that is, for many processes per node — there are two cases:
- Up to Open MPI v1.3.1: Shared-memory usage is dominated by the FIFOs, which consume a certain number of pages. Usage is high and cannot be reduced much via MCA parameter tuning.
- Starting with Open MPI v1.3.2: Shared-memory usage is dominated by the eager free lists. So, you can reduce the MCA parameter btl_sm_eager_limit.

254. How do I specify to use the IP network for MPI messages?

In general, you specify that the tcp BTL component should be used. This will direct Open MPI to use TCP-based communications over IP interfaces / networks.

However, note that you should also specify that the self BTL component should be used. self is for loopback communication (i.e., when an MPI process sends to itself), and is technically a different communication channel than TCP. For example:

1	shell$ mpirun --mca btl tcp,self ...

Failure to specify the self BTL may result in Open MPI being unable to complete send-to-self scenarios (meaning that your program will run fine until a process tries to send to itself).

Note that if the tcp BTL is available at run time (which it should be on most POSIX-like systems), Open MPI should automatically use it by default (ditto for self). Hence, it's usually unnecessary to specify these options on the mpirun command line. They are typically only used when you want to be absolutely positively definitely sure to use the specific BTL.

If you are using a high-speed network (such as Myrinet or InfiniBand), be sure to also see this FAQ entry.

255. But wait — I'm using a high-speed network. Do I have to disable the TCP BTL?

No. Following the so-called "Law of Least Astonishment", Open MPI assumes that if you have both a IP network and at least one high-speed network (such InfiniBand), you will likely only want to use the high-speed network(s) for MPI message passing. Hence, the tcp BTL component will sense this and automatically deactivate itself.

That being said, Open MPI may still use TCP for setup and teardown information — so you'll see traffic across your IP network during startup and shutdown of your MPI job. This is normal and does not affect the MPI message passing channels.

256. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for the tcp BTL component (i.e., the component that uses TCP for MPI communications):

1	shell$ ompi_info --param btl tcp --level 9

NOTE: Prior to the Open MPI 1.7 series, ompi_info would show all MCA parameters by default. Starting with Open MPI v1.7, you need to specify --level 9 (or --all) to show all MCA parameters.

257. Does Open MPI use the IP loopback interface?

Usually not.

In general message passing usage, there are two scenarios where using the IP loopback interface could be used:

Sending a message from one process to itself
Sending a message from one process to another process on the same machine

The TCP BTL does not handle "send-to-self" scenarios in Open MPI; indeed, it is not even capable of doing so. Instead, the self BTL component is used for all send-to-self MPI communications. Not only does this allow all Open MPI BTL components to avoid special case code for send-to-self scenarios, it also allows avoiding using inefficient loopback network stacks (such as the IP loopback device).

Specifically: the self component uses its own mechanisms for send-to-self scenarios; it does not use network interfaces.

When sending to other processes on the same machine, Open MPI will default to using a shared memory BTL (sm or vader). If the user has deactivated these BTLs, depending on what other BTL components are available, it is possible that the TCP BTL will be chosen for message passing to processes on the same node, in which case the IP loopback device will likely be used. But this is not the default; either shared memory has to fail to startup properly or the user must specifically request not to use the shared memory BTL.

258. I have multiple IP networks on some/all of my cluster nodes. Which ones will Open MPI use?

In general, Open MPI will greedily use all IP networks that it finds per its reachability computations.

To change this behavior, you can either specifically include certain networks or specifically exclude certain networks. See this FAQ entry for more details.

259. I'm getting TCP-related errors. What do they mean?

TCP-related errors are usually reported by Open MPI in a message similar to these:

1 2	btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 mca_btl_tcp_frag_send: writev failed with errno=104

If an error number is displayed with no explanation string, you can see what that specific error number means on your operating system with the following command (the following example was run on Linux; results may be different on other operating systems):

shell$ perl -e 'die$!=113'
No route to host at -e line 1.
shell$ perl -e 'die$!=104'
Connection reset by peer at -e line 1.

Two types of errors are commonly reported to the Open MPI user's mailing list:

No route to host: These types of errors usually mean that there are multiple IP interfaces available and they do not obey Open MPI's assumptions about routability. See these two FAQ items for more information:
- Open MPI's routability assumptions
- How to force specific IP interfaces

Connection reset by peer: These types of errors usually occur after MPI_INIT has completed, and typically indicate that an MPI process has died unexpectedly (e.g., due to a seg fault). The specific error message indicates that a peer MPI process tried to write to the now-dead MPI process and failed.

260. How do I tell Open MPI which IP interfaces / networks to use?

In some parallel environments, it is not uncommon to have multiple IP interfaces on each node — for example, one IP network may be "slow" and used for control information such as a batch scheduler, a networked filesystem, and/or interactive logins. Another IP network (or networks) may be "fast" and be intended for parallel applications to use during their runs. As another example, some operating systems may also have virtual interfaces for communicating with virtual machines.

Unless otherwise specified, Open MPI will greedily use all "up" IP networks that it can find and try to connect to all peers _upon demand_ (i.e., Open MPI does not open sockets to all of its MPI peers during MPI_INIT — see this FAQ entry for more details). Hence, if you want MPI jobs to not use specific IP networks — or not use any IP networks at all — then you need to tell Open MPI.

NOTE: Aggressively using all "up" interfaces can cause problems in some cases. For example, if you have a machine with a local-only interface (e.g., the loopback device, or a virtual-machine bridge device that can only be used on that machine, and cannot be used to communicate with MPI processes on other machines), you will likely need to tell Open MPI to ignore these networks. Open MPI usually ignores loopback devices by default, but *other local-only devices must be manually ignored.* Users have reported cases where RHEL6 automatically installed a "virbr0" device for Xen virtualization. This interface was automatically given an IP address in the 192.168.1.0/24 subnet and marked as "up". Since Open MPI saw this 192.168.1.0/24 "up" interface in all MPI processes on all nodes, it assumed that that network was usable for MPI communications. This is obviously incorrect, and it led to MPI applications hanging when they tried to send or receive MPI messages.

To disable Open MPI from using TCP for MPI communications, the tcp MCA parameter should be set accordingly. You can either exclude the TCP component or include all other components. Specifically:

# This says to exclude the TCP BTL component
# (implicitly including all others)
shell$ mpirun --mca btl ^tcp...
 
# This says to include only the listed BTL components
# (tcp is not listed, and therefore will not be used)
shell$ mpirun --mca btl self,vader,openib ...

If you want to use TCP for MPI communications, but want to restrict it from certain networks, use the btl_tcp_if_include or btl_tcp_if_exclude MCA parameters (only one of the two should be set). The values of these parameters can be a comma-delimited list of network interfaces. For example:

# This says to not use the eth0 and lo interfaces.
# (and implicitly use all the rest).  Per the description
# above, IP loopback and all local-only devices *must*
# be included if the exclude list is specified.
shell$ mpirun --mca btl_tcp_if_exclude lo,eth0 ...
 
# This says to only use the eth1 and eth2 interfaces
# (and implicitly ignore the rest)
shell$ mpirun --mca btl_tcp_if_include eth1,eth2 ...

Starting in the Open MPI v1.5 series, you can specify subnets in the include or exclude lists in CIDR notation. For example:
1 2 3
# Only use the 192.168.1.0/24 and 10.10.0.0/16 subnets for MPI # communications: shell$ mpirun --mca btl_tcp_if_include 192.168.1.0/24,10.10.0.0/16 ...
NOTE: You must specify the CIDR notation for a given network precisely. For example, if you have two IP networks 10.10.0.0/24 and 10.10.1.0/24, Open MPI will not recognize either of them if you specify "10.10.0.0/16".
NOTE: If you use the btl_tcp_if_include and btl_tcp_if_exclude MCA parameters to shape the behavior of the TCP BTL for MPI communications, you may also need/want to investigate the corresponding MCA parameters oob_tcp_if_include and oob_tcp_if_exclude, which are used to shape non-MPI TCP-based communication (e.g., communications setup and coordination during MPI_INIT and MPI_FINALIZE).

Note that Open MPI will still use TCP for control messages, such as data between mpirun and the MPI processes, rendezvous information during MPI_INIT, etc. To disable TCP altogether, you also need to disable the tcp component from the OOB framework.

261. Does Open MPI open a bunch of sockets during MPI_INIT?

Although Open MPI is likely to open multiple TCP sockets during MPI_INIT, the tcp BTL component *does not open one socket per MPI peer process during MPI_INIT.* Open MPI opens sockets as they are required — so the first time a process sends a message to a peer and there is a TCP connection between the two, Open MPI will automatically open a new socket.

Hence, you should not have scalability issues with running large numbers of processes (e.g., running out of per-process file descriptors) if your parallel application is sparse in its communication with peers.

262. Are there any Linux kernel TCP parameters that I should set?

Everyone has different opinions on this, and it also depends on your exact hardware and environment. Below are general guidelines that some users have found helpful.

net.ipv4.tcp_syn_retries: Some Linux systems have very large initial connection timeouts — they retry sending SYN packets many times before determining that a connection cannot be made. If MPI is going to fail to make socket connections, it would be better for them to fail somewhat quickly (minutes vs. hours). You might want to reduce this value to a smaller value; YMMV.

net.ipv4.tcp_keepalive_time: Some MPI applications send an initial burst of MPI messages (over TCP) and then send nothing for long periods of time (e.g., embarrassingly parallel applications). Linux may decide that these dormant TCP sockets are dead because it has seen no traffic on them for long periods of time. You might therefore need to lengthen the TCP inactivity timeout. Many Linux systems default to 7,200 seconds; increase it if necessary.

Increase TCP buffering for 10G or 40G Ethernet. Many Linux distributions come with good buffering presets for 1G Ethernet. In a datacenter/HPC cluster with 10G or 40G Ethernet NICs, this amount of kernel buffering is typically insufficient. Here's a set of parameters that some have used for good 10G/40G TCP bandwidth:
- net.core.rmem_max: 16777216
- net.core.wmem_max: 16777216
- net.ipv4.tcp_rmem: 4096 87380 16777216
- net.ipv4.tcp_wmem: 4096 65536 16777216
- net.core.netdev_max_backlog: 30000
- net.core.rmem_default: 16777216
- net.core.wmem_default: 16777216
- net.ipv4.tcp_mem: '16777216 16777216 16777216'
- net.ipv4.route.flush: 1

Each of the above items is a Linux kernel parameter that can be set in multiple different ways.

You can change the running kernel via the /proc filesystem:

1
2
3

shell# cat /proc/sys/net/ipv4/tcp_syn_retries
5
shell# echo 6 > /proc/sys/net/ipv4/tcp_syn_retries

You can also use the sysctl command:

shell# sysctl net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 5
shell# sysctl -w net.ipv4.tcp_syn_retries=6
net.ipv4.tcp_syn_retries = 6

Or you can set them by adding entries in /etc/sysctl.conf, which are persistent across reboots:
1 2
shell$ grep tcp_syn_retries /etc/sysctl.conf net.ipv4.tcp_syn_retries = 6
Your Linux distro may also support putting individual files in /etc/sysctl.d (even if that directory does not yet exist), which is actually better practice than putting them in /etc/sysctl.conf. For example:
1 2
shell$ cat /etc/sysctl.d/my-tcp-settings net.ipv4.tcp_syn_retries = 6

263. How does Open MPI know which IP addresses are routable to each other in Open MPI 1.2?

This is a fairly complicated question — there can be ambiguity when hosts have multiple NICs and/or there are multiple IP networks that are not routable to each other in a single MPI job.

It is important to note that Open MPI's atomic unit of routing is a process — not an IP address. Hence, Open MPI makes connections between processes, not nodes (these processes are almost always on remote nodes, but it's still better to think in terms of processes, not nodes).

Specifically, since Open MPI jobs can span multiple IP networks, each MPI process may be able to use multiple IP addresses to communicate with each other MPI process (and vice versa). So for each process, Open MPI needs to determine which IP address — if any — to use to connect to a peer MPI process.

For example, say that you have a cluster with 16 nodes on a private ethernet network. One of these nodes doubles as the head node for the cluster and therefore has 2 ethernet NICs — one to the external network and one to the internal cluster network. But since 16 is a nice number, you also want to use it for computation as well. So when you mpirun spanning all 16 nodes, OMPI has to figure out to not use the external NIC on the head node and only use the internal NIC.

To explain what happens, we need to explain some of what happens in MPI_INIT. Even though Open MPI only makes TCP connections between peer MPI processes upon demand (see this FAQ entry), each process publishes its TCP contact information which is then made available to all processes. Hence, every process knows the IP address(es) and corresponding port number(s) to contact every other process.

But keep in mind that these addresses may span multiple IP networks and/or not be routable to each other. So when a connection is requested, the TCP BTL component in Open MPI creates pairwise combinations of all the IP addresses of the localhost to all the IP addresses of the peer process, looking for a match.

A "match" is defined by the following rules:

If the two IP addresses match after the subnet mask is applied, assume that they are mutually routable and allow the connection.
If the two IP addresses are public, assume that they are mutually routable and allow the connection.
Otherwise, the connection is disallowed (this is not an error — we just disallow this connection on the hope that some other device can be used to make a connection).

These rules tend to cover the following scenarios:

A cluster on a private network with a head node that has a NIC on the private network and the public network
Clusters that have all public addresses

These rules do not cover the following cases:

Running an MPI job that spans public and private networks
Running an MPI job that spans a bunch of private networks with narrowly-scoped netmasks, such as nodes that have IP addresses 192.168.1.10 and 192.168.2.10 with netmasks of 255.255.255.0 (i.e., the network fabric makes these two nodes be routable to each other, even though the netmask implies that they are on different subnets).

264. How does Open MPI know which IP addresses are routable to each other in Open MPI 1.3 (and beyond)?

Starting with the Open MPI v1.3 series, assumptions about routability are much different than prior series.

With v1.3 and later, Open MPI assumes that all interfaces are routable as long as they have the same address family, IPv4 or IPv6. We use graph theory and give each possible connection a weight depending on the quality of the connection. This allows the library to select the best connections between nodes. This method also supports striping but prevents more than one connection to any interface.

The quality of the connection is defined as follows, with a higher number meaning better connection. Note that when giving a weight to a connection consisting of a private address and a public address, it will give it the weight of PRIVATE_DIFFERENT_NETWORK.

            NO_CONNECTION = 0
PRIVATE_DIFFERENT_NETWORK = 1
PRIVATE_SAME_NETWORK      = 2
PUBLIC_DIFFERENT_NETWORK  = 3
PUBLIC_SAME_NETWORK       = 4

At this point, an example will best illustrate how two processes on two different nodes would connect up. Here we have two nodes with a variety of interfaces.

        NodeA                NodeB
   ---------------       ---------------
  |     lo0       |     |     lo0       |
  |  127.0.0.1    |     |  127.0.0.1    |
  |  255.0.0.0    |     |  255.0.0.0    |
  |               |     |               |
  |     eth0      |     |    eth0       |
  |   10.8.47.1   |     |   10.8.47.2   |
  | 255.255.255.0 |     | 255.255.255.0 |
  |               |     |               |
  |     ibd0      |     |     ibd0      |
  |  192.168.1.1  |     |  192.168.1.2  |
  | 255.255.255.0 |     | 255.255.255.0 |
  |               |     |               |
  |     ibd1      |     |               |
  |  192.168.2.2  |     |               |
  | 255.255.255.0 |     |               |
   ---------------       ---------------

From these two nodes, the software builds up a bipartite graph that shows all the possible connections with all the possible weights. The lo0 interfaces are excluded as the btl_tcp_if_exclude MCA parameter is set to lo by default. Here is what all the possible connections with their weights look like.

     NodeA         NodeB
eth0 --------- 2 -------- eth0
    \
     \
      \------- 1 -------- ibd0
 
ibd0 --------- 1 -------- eth0
    \
     \
      \------- 2 -------- ibd0
 
ibd1 --------- 1 -------- eth0
    \
     \
      \------- 1 -------- ibd0

The library then examines all the connections and picks the optimal ones. This leaves us with two connections being established between the two nodes.

If you are curious about the actual connect() calls being made by the processes, then you can run with --mca btl_base_verbose 30. This can be useful if you notice your job hanging and believe it may be the library trying to make connections to unreachable hosts.

# Here is an example with some of the output deleted for clarity.
# One can see the connections that are attempted.
shell$ mpirun --mca btl self,sm,tcp --mca btl_base_verbose 30 -np 2 -host NodeA,NodeB a.out
[...snip...]
[NodeA:18003] btl: tcp: attempting to connect() to address 10.8.47.2 on port 59822
[NodeA:18003] btl: tcp: attempting to connect() to address 192.168.1.2 on port 59822
[NodeB:16842] btl: tcp: attempting to connect() to address 192.168.1.1 on port 44500
[...snip...]

In case you want more details about the theory behind the connection code, you can find the background story in a brief IEEE paper.

265. Does Open MPI ever close TCP sockets?

In general, no.

Although TCP sockets are opened "lazily" (meaning that MPI connections / TCP sockets are only opened upon demand — as opposed to opening all possible sockets between MPI peer processes during MPI_INIT), they are never closed.

266. Does Open MPI support IP interfaces that have more than one IP address?

In general, no.

For example, if the output from your ifconfig has a single IP device with multiple IP addresses like this:

0: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
   link/ether 00:18:ae:f4:d2:29 brd ff:ff:ff:ff:ff:ff
   inet 192.168.0.3/24 brd 192.168.0.255 scope global eth0:1
   inet 10.10.0.3/24 brf 10.10.0.255 scope global eth0
   inet6 fe80::218:aef2:29b4:2c4/64 scope link
      valid_lft forever preferred_lft forever

(note the two "inet" lines in there)

Then Open MPI will be unable to use this device.

267. Does Open MPI support virtual IP interfaces?

No.

For example, if the output of your ifconfig has both "eth0" and "eth0:0", Open MPI will get confused if you use the TCP BTL, and may hang or otherwise act unpredictably.

Note that using btl_tcp_if_include or btl_tcp_if_exclude to avoid using the virtual interface will not solve the issue.

This may get fixed in a future release. See GitHub issue #160 to follow the progress on this issue.

268. Why do I only see 5 Gbps bandwidth benchmark results on 10 GbE or faster networks?

Before the 3.0 release series, Open MPI set two TCP tuning parameters which, while a little large for 1 Gbps networks in 2005, were woefully undersized for modern 10 Gbps networks. Further, the Linux kernel TCP stack has progressed to a dynamic buffer scheme, allowing even larger buffers (and therefore window sizes). The Open MPI parameters meant that for most any multi-switch 10 GbE configuration, the TCP window could not cover the bandwidth-delay product of the network and, therefore, a single TCP flow could not saturate the network link.

Open MPI 3.0 and later removed the problematic tuning parameters and let the kernel do its (much more intelligent) thing. If you still see unexpected bandwidth numbers in your network, this may be a bug. Please file a GitHub Issue. The tuning parameter patch was backported to the 2.0 series in 2.0.3 and the 2.1 series in 2.1.2, so those versions and later should also not require workarounds. For earlier versions, the parameters can be modified with an MCA parameter:

1	shell$ mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...

269. Can I use multiple TCP connections to improve network performance?

Open MPI 4.0.0 and later can use multiple TCP connections between any pair of MPI processes, striping large messages across the connections. The btl_tcp_links parameter can be used to set how many TCP connections should be established between MPI ranks. Note that this may not improve application performance for common use cases of nearest-neighbor exchanges when there many MPI ranks on each host. In these cases, there are already many TCP connections between any two hosts (because of the many ranks all communicating), so the extra TCP connections are likely just consuming extra resources and adding work to the MPI implementation. However, for highly multi-threaded applications, where there are only one or two MPI ranks per host, the btl_tcp_links option may improve TCP throughput considerably.

270. What Myrinet-based components does Open MPI have?

Some versions of Open MPI support both GM and MX for MPI communications.

Open MPI series	GM supported	MX supported
v1.0 series	Yes	Yes
v1.1 series	Yes	Yes
v1.2 series	Yes	Yes (BTL and MTL)
v1.3 / v1.4 series	Yes	Yes (BTL and MTL)
v1.5 / v1.6 series	No	Yes (BTL and MTL)
v1.7 / v1.8 series	No	Yes (MTL only)
v1.10 and beyond	No	No

271. How do I specify to use the Myrinet GM network for MPI messages?

In general, you specify that the gm BTL component should be used. However, note that you should also specify that the self BTL component should be used. self is for loopback communication (i.e., when an MPI process sends to itself). This is technically a different communication channel than Myrinet. For example:

1	shell$ mpirun --mca btl gm,self ...

Failure to specify the self BTL may result in Open MPI being unable to complete send-to-self scenarios (meaning that your program will run fine until a process tries to send to itself).

To use Open MPI's shared memory support for on-host communication instead of GM's shared memory support, simply include the sm BTL. For example:

1	shell$ mpirun --mca btl gm,sm,self ...

Finally, note that if the gm component is available at run time, Open MPI should automatically use it by default (ditto for self and sm). Hence, it's usually unnecessary to specify these options on the mpirun command line. They are typically only used when you want to be absolutely positively definitely sure to use the specific BTL.

272. How do I specify to use the Myrinet MX network for MPI messages?

As of version 1.2, Open MPI has two different components to support Myrinet MX, the mx BTL and the mx MTL, only one of which can be used at a time. Prior versions only have the mx BTL.

If available, the mx BTL is used by default. However, to be sure it is selected you can specify it. Note that you should also specify the self BTL component (for loopback communication) and the sm BTL component (for on-host communication). For example:

1	shell$ mpirun --mca btl mx,sm,self ...

To use the mx MTL component, it must be specified. Also, you must use the cm PML component. For example:

1	shell$ mpirun --mca mtl mx --mca pml cm ...

Note that one cannot use both the mx MTL and the mx BTL components at once. Deciding which to use largely depends on the application being run.

273. But wait — I also have a TCP network. Do I need to explicitly disable the TCP BTL?

No. See this FAQ entry for more details.

274. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for the gm and mx BTL components and the mx MTL component:

# Show the gm BTL parameters
shell$ ompi_info <font color=red><strong>--param btl gm</strong></font>
 
# Show the mx BTL parameters
shell$ ompi_info <font color=red><strong>--param btl mx</strong></font>
 
# Show the mx MTL parameters
shell$ ompi_info <font color=red><strong>--param mtl mx</strong></font>

275. I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help?

In order for us to help you, it is most helpful if you can run a few steps before sending an e-mail to both perform some basic troubleshooting and provide us with enough information about your environment to help you. Please include answers to the following questions in your e-mail:

Which Myricom software stack are you running: GM or MX? Which version?

Are you using "fma", the "gm_mapper", or the "mx_mapper"?

If running GM, include the output from running the gm_board_info from a known "good" node and a known "bad" node.
If running MX, include the output from running mx_info from a known "good" node and a known "bad" node.

Is the "Map version" value from this output the same across all nodes?

NOTE: If the map version is not the same, ensure that you are not running a mixture of FMA on some nodes and the mapper on others. Also check the connectivity of nodes that seem to have an inconsistent map version.

What are the contents of the file /var/run/fms/fma.log?

Gather up this information and see this page about how to submit a help request to the user's mailing list.

276. How do I adjust the MX first fragment size? Are there constraints?

The MX library limits the maximum message fragment size for both on-node and off-node messages. As of MX v1.0.3, the inter-node maximum fragment size is 32k, and the intra-node maximum fragment size is 16k — fragments sent larger than these sizes will fail.

Open MPI automatically fragments large messages; it currently limits its first fragment size on MX networks to the lower of these two values — 16k. As such, increasing the value of the MCA parameter btl_mx_first_frag_size larger than 16k may cause failures in some cases (e.g., when using MX to send large messages to processes on the same node); it will cause failures in all cases if it is set above 32k.

Note that this only affects the first fragment of messages; latter fragments do not have this size restriction. The MCA parameter btl_mx_max_send_size can be used to vary the maximum size of subsequent fragments.

277. What Open MPI components support InfiniBand / RoCE / iWARP?

In order to meet the needs of an ever-changing networking hardware and software ecosystem, Open MPI's support of InfiniBand, RoCE, and iWARP has evolved over time.

Here is a summary of components in Open MPI that support InfiniBand, RoCE, and/or iWARP, ordered by Open MPI release series:

Open MPI series	OpenFabrics support
v1.0 series	`openib` and `mvapi` BTLs
v1.1 series	`openib` and `mvapi` BTLs
v1.2 series	`openib` and `mvapi` BTLs
v1.3 / v1.4 series	`openib` BTL
v1.5 / v1.6 series	`openib` BTL, `mxm` MTL, `fca` coll
v1.7 / v1.8 series	`openib` BTL, `mxm` MTL, `fca` and `ml` and `hcoll` coll
v2.x series	`openib` BTL, `yalla` (MXM) PML, `ml` and `hcoll` coll
v3.x series	`openib` BTL, `ucx` and `yalla` (MXM) PML, `hcoll` coll
v4.x series	`openib` BTL, `ucx` PML, `hcoll` coll, `ofi` MTL

History / notes:

The openib BTL uses the OpenFabrics Alliance's (OFA) verbs API stack to support InfiniBand, RoCE, and iWARP devices. The OFA's original name was "OpenIB", which is why the BTL is named openib.
Before the verbs API was effectively standardized in the OFA's verbs stack, Open MPI supported Mellanox VAPI in the mvapi module. The MVAPI API stack has long-since been discarded, and is no longer supported after Open MPI the v1.2 series.
The next-generation, higher-abstraction API for support InfiniBand and RoCE devices is named UCX. As of Open MPI v1.4, the ucx PML is the preferred mechanism for utilizing InfiniBand and RoCE devices. As of UCX v1.8, iWARP is not supported. See this FAQ entry for more information about iWARP.

278. What component will my OpenFabrics-based network use by default?

Per this FAQ item, OpenFabrics-based networks have generally used the openib BTL for native verbs-based communication for MPI point-to-point communications. Because of this history, many of the questions below refer to the openib BTL, and are specifically marked as such.

The following are exceptions to this general rule:

In the v2.x and v3.x series, Mellanox InfiniBand devices defaulted to MXM-based components (e.g., mxm and/or yalla).
In the v4.0.x series, Mellanox InfiniBand devices default to the ucx PML. The use of InfiniBand over the openib BTL is officially deprecated in the v4.0.x series, and is scheduled to be removed in Open MPI v5.0.0.

That being said, it is generally possible for any OpenFabrics device to use the openib BTL or the ucx PML:

To make the openib BTL use InfiniBand in v4.0.x, set the btl_openib_allow_ib parameter to 1.
See this FAQ item for information about using the ucx PML with arbitrary OpenFabrics devices.

279. Does Open MPI support iWARP?

iWARP is fully supported via the openib BTL as of the Open MPI v1.3 release.

Note that the openib BTL is scheduled to be removed from Open MPI starting with v5.0.0. After the openib BTL is removed, support for iWARP is murky, at best. As of June 2020 (in the v4.x series), there are two alternate mechanisms for iWARP support which will likely continue into the v5.x series:

The cm PML with the ofi MTL. This mechanism is actually designed for networks that natively support "MPI-style matching", which iWARP does not support. Hence, Libfabric adds in a layer of software emulation to provide this functionality. This slightly decreases Open MPI's performance on iWARP networks. That being said, it seems to work correctly.

The ofi BTL. A new/prototype BTL named ofi is being developed (and can be used in place of the openib BTL); it uses Libfabric to directly access the native iWARP device functionality -- without the software emulation performance penality from using the "MPI-style matching" of the cm PML + ofi MTL combination. However, the ofi BTL is neither widely tested nor fully developed. As of June 2020, it did not work with iWARP, but may be updated in the future.

This state of affairs reflects that the iWARP vendor community is not involved with Open MPI; we therefore have no one who is actively developing, testing, or supporting iWARP users in Open MPI. If anyone is interested in helping with this situation, please let the Open MPI developer community know.

NOTE: A prior version of this FAQ entry stated that iWARP support was available through the ucx PML. That was incorrect. As of UCX v1.8, iWARP is not supported.

280. Does Open MPI support RoCE (RDMA over Converged Ethernet)?

RoCE is fully supported as of the Open MPI v1.4.4 release.

As of Open MPI v4.0.0, the UCX PML is the preferred mechanism for running over RoCE-based networks. See this FAQ entry for details.

The openib BTL is also available for use with RoCE-based networks through the v4.x series; see this FAQ entry for information how to use it. Note, however, that the openib BTL is scheduled to be removed from Open MPI in v5.0.0.

281. I have an OFED-based cluster; will Open MPI work with that?

Yes.

OFED (OpenFabrics Enterprise Distribution) is basically the release mechanism for the OpenFabrics software packages. OFED releases are officially tested and released versions of the OpenFabrics stacks.

282. Where do I get the OFED software from?

The "Download" section of the OpenFabrics web site has links for the various OFED releases.

Additionally, Mellanox distributes Mellanox OFED and Mellanox-X binary distributions. Consult with your IB vendor for more details.

283. Isn't Open MPI included in the OFED software package? Can I install another copy of Open MPI besides the one that is included in OFED?

Yes, Open MPI used to be included in the OFED software. And yes, you can easily install a later version of Open MPI on OFED-based clusters, even if you're also using the Open MPI that was included in OFED.

You can simply download the Open MPI version that you want and install it to an alternate directory from where the OFED-based Open MPI was installed. You therefore have multiple copies of Open MPI that do not conflict with each other. Make sure you set the PATH and LD_LIBRARY_PATH variables to point to exactly one of your Open MPI installations at a time, and never try to run an MPI executable compiled with one version of Open MPI with a different version of Open MPI.

Ensure to specify to build Open MPI with OpenFabrics support; see this FAQ item for more information.

284. What versions of Open MPI are in OFED?

The following versions of Open MPI shipped in OFED (note that OFED stopped including MPI implementations as of OFED 1.5):

OFED 1.4.1: Open MPI v1.3.2.

OFED 1.4: Open MPI v1.2.8.

OFED 1.3.1: Open MPI v1.2.6.

OFED 1.3: Open MPI v1.2.5.

OFED 1.2: Open MPI v1.2.1.
NOTE: A prior version of this FAQ entry specified that "v1.2ofed" would be included in OFED v1.2, representing a temporary branch from the v1.2 series that included some OFED-specific functionality. All of this functionality was included in the v1.2.1 release, so OFED v1.2 simply included that. Some public betas of "v1.2ofed" releases were made available, but this version was never officially released.
OFED 1.1: Open MPI v1.1.1.

OFED 1.0: Open MPI v1.1b1.

285. Why are you using the name "openib" for the BTL name?

Before the iWARP vendors joined the OpenFabrics Alliance, the project was known as OpenIB. Open MPI's support for this software stack was originally written during this timeframe — the name of the group was "OpenIB", so we named the BTL openib.

Since then, iWARP vendors joined the project and it changed names to "OpenFabrics". Open MPI did not rename its BTL mainly for historical reasons — we didn't want to break compatibility for users who were already using the openib BTL name in scripts, etc.

286. Is the mVAPI-based BTL still supported?

Yes, but only through the Open MPI v1.2 series; mVAPI support was removed starting with v1.3.

The mVAPI support is an InfiniBand-specific BTL (i.e., it will not work in iWARP networks), and reflects a prior generation of InfiniBand software stacks.

The Open MPI team is doing no new work with mVAPI-based networks.

Generally, much of the information contained in this FAQ category applies to both the OpenFabrics openib BTL and the mVAPI mvapi BTL — simply replace openib with mvapi to get similar results. However, new features and options are continually being added to the openib BTL (and are being listed in this FAQ) that will not be back-ported to the mvapi BTL. So not all openib-specific items in this FAQ category will apply to the mvapi BTL.

All that being said, as of Open MPI v4.0.0, the use of InfiniBand over the openib BTL is deprecated — the UCX PML is the preferred way to run over InfiniBand.

287. How do I specify to use the OpenFabrics network for MPI messages? (openib BTL)

In general, you specify that the openib BTL components should be used. However, note that you should also specify that the self BTL component should be used. self is for loopback communication (i.e., when an MPI process sends to itself), and is technically a different communication channel than the OpenFabrics networks. For example:

1	shell$ mpirun --mca btl openib,self ...

Failure to specify the self BTL may result in Open MPI being unable to complete send-to-self scenarios (meaning that your program will run fine until a process tries to send to itself).

Note that openib,self is the minimum list of BTLs that you might want to use. It is highly likely that you also want to include the vader (shared memory) BTL in the list as well, like this:

1	shell$ mpirun --mca btl openib,self,vader ...

NOTE: Prior versions of Open MPI used an sm BTL for shared memory. sm was effectively replaced with vader starting in Open MPI v3.0.0.

See this FAQ entry for more details on selecting which MCA plugins are used at run-time.

Finally, note that if the openib component is available at run time, Open MPI should automatically use it by default (ditto for self). Hence, it's usually unnecessary to specify these options on the mpirun command line. They are typically only used when you want to be absolutely positively definitely sure to use the specific BTL.

288. But wait — I also have a TCP network. Do I need to explicitly disable the TCP BTL?

No. See this FAQ entry for more details.

289. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for any Open MPI component. For example:

# Note that Open MPI v1.8 and later will only show an abbreviated list
# of parameters by default.  Use "--level 9" to show all available
# parameters.
 
# Show the UCX PML parameters
shell$ ompi_info --param pml ucx --level 9
 
# Show the openib BTL parameters
shell$ ompi_info --param btl openib --level 9

290. I'm experiencing a problem with Open MPI on my OpenFabrics-based network; how do I troubleshoot and get help?

Which Open MPI component are you using? Possibilities include: the ucx PML, the yalla PML, the mxm MTL, the openib BTL.

Which OpenFabrics version are you running? Please specify where you got the software from (e.g., from the OpenFabrics community web site, from a vendor, or it was already included in your Linux distribution).

What distro and version of Linux are you running? What is your kernel version?

Which subnet manager are you running? (e.g., OpenSM, a vendor-specific subnet manager, etc.)

What is the output of the ibv_devinfo command on a known "good" node and a known "bad" node? (NOTE: there must be at least one port listed as "PORT_ACTIVE" for Open MPI to work. If there is not at least one PORT_ACTIVE port, something is wrong with your OpenFabrics environment and Open MPI will not be able to run).

What is the output of the ifconfig command on a known "good" node and a known "bad" node? (Mainly relevant for IPoIB installations.) Note that some Linux distributions do not put ifconfig in the default path for normal users; look for it at /sbin/ifconfig or /usr/sbin/ifconfig.

If running under Bourne shells, what is the output of the [ulimit -l] command?
If running under C shells, what is the output of the limit | grep memorylocked command?
(NOTE: If the value is not "unlimited", see this FAQ entry and this FAQ entry).

Gather up this information and see this page about how to submit a help request to the user's mailing list.

291. What is "registered" (or "pinned") memory?

"Registered" memory means two things:

The memory has been "pinned" by the operating system such that the virtual memory subsystem will not relocate the buffer (until it has been unpinned).

The network adapter has been notified of the virtual-to-physical address mapping.

These two factors allow network adapters to move data between the network fabric and physical RAM without involvement of the main CPU or operating system.

Note that many people say "pinned" memory when they actually mean "registered" memory.

However, a host can only support so much registered memory, so it is treated as a precious resource. Additionally, the cost of registering (and unregistering) memory is fairly high. Open MPI takes aggressive steps to use as little registered memory as possible (balanced against performance implications, of course) and mitigate the cost of registering and unregistering memory.

292. I'm getting errors about "error registering openib memory"; what do I do? (openib BTL)

With OpenFabrics (and therefore the openib BTL component), you need to set the available locked memory to a large number (or better yet, unlimited) — the defaults with most Linux installations are usually too low for most HPC applications that utilize OpenFabrics. Failure to do so will result in a error message similar to one of the following (the messages have changed throughout the release versions of Open MPI):

WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
 
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
 
See this Open MPI FAQ item for more information on these Linux kernel
module parameters:
 
    http://www.linux-pam.org/Linux-PAM-html/sag-pam_limits.html
 
  Local host:              node02
  Registerable memory:     32768 MiB
  Total memory:            65476 MiB
 
Your MPI job will continue, but may be behave poorly and/or hang.

1 2	[node06:xxxx] mca_mpool_openib_register: \ error registering openib memory of size yyyy errno says Cannot allocate memory

1 2	[x,y,z][btl_openib.c:812:mca_btl_openib_create_cq_srq] \ error creating low priority cq for mthca0 errno says Cannot allocate memory

1 2	libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations.

The OpenIB BTL failed to initialize while trying to create an internal
queue.  This typically indicates a failed OpenFabrics installation or
faulty hardware.  The failure occurred here:
 
    Host:        compute_node.example.com
    OMPI source: btl_openib.c:828
    Function:    ibv_create_cq()
    Error:       Invalid argument (errno=22)
    Device:      mthca0
 
You may need to consult with your system administrator to get this
problem fixed.

The OpenIB BTL failed to initialize while trying to allocate some
locked memory.  This typically can indicate that the memlock limits
are set too low.  For most HPC installations, the memlock limits
should be set to "unlimited".  The failure occurred here:
 
    Host:          compute_node.example.com
    OMPI source:   btl_opebib.c:114
    Function:      ibv_create_cq()
    Device:        Out of memory
    Memlock limit: 32767
 
You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:
 
    http://www.linux-pam.org/Linux-PAM-html/sag-pam_limits.html

1	error creating qp errno says Cannot allocate memory

There are two typical causes for Open MPI being unable to register memory, or warning that it might not be able to register enough memory:

Linux kernel module parameters that control the amount of available registered memory are set too low; see this FAQ entry.
System / user needs to increase locked memory limits: see this FAQ entry and this FAQ entry.

293. How can a system administrator (or user) change locked memory limits?

There are two ways to control the amount of memory that a user process can lock:

Assuming that the PAM limits module is being used (see full docs for the Linux PAM limits module, or this mirror), the system-level default values are controlled by putting a file in /etc/security/limits.d/ (or directly editing the /etc/security/limits.conf file on older systems). Two limits are configurable:
- Soft: The "soft" value is how much memory is allowed to be locked by user processes by default. Set it by creating a file in /etc/security/limits.d/ (e.g., 95-openfabrics.conf) with the line below (or, if your system doesn't have a /etc/security/limits.d/ directory, add a line directly to /etc/security/limits.conf):
  1
  * soft memlock <number>
  where <number> is the number of bytes that you want user processes to be allowed to lock by default (presumably rounded down to an integral number of pages). <number> can also be unlimited.
- Hard: The "hard" value is the maximum amount of memory that a user process can lock. Similar to the soft lock, add it to the file you added to /etc/security/limits.d/ (or editing /etc/security/limits.conf directly on older systems):
  1
  * hard memlock <number>
  where <number> is the maximum number of bytes that you want user processes to be allowed to lock (presumably rounded down to an integral number of pages). <number> can also be unlimited.

Per-user default values are controlled via the ulimit command (or limit in csh). The default amount of memory allowed to be locked will correspond to the "soft" limit set in /etc/security/limits.d/ (or limits.conf — see above); users cannot use ulimit (or limit in csh) to set their amount to be more than the hard limit in /etc/security/limits.d (or limits.conf).
Users can increase the default limit by adding the following to their shell startup files for Bourne style shells (sh, bash):
1
shell$ ulimit -l unlimited
Or for C style shells (csh, tcsh):
1
shell% limit memorylocked unlimited
This effectively sets their limit to the hard limit in /etc/security/limits.d (or limits.conf). Alternatively, users can set a specific number instead of "unlimited", but this has limited usefulness unless a user is aware of exactly how much locked memory they will require (which is difficult to know since Open MPI manages locked memory behind the scenes).
It is important to realize that this must be set in all shells where Open MPI processes using OpenFabrics will be run. For example, if you are using rsh or ssh to start parallel jobs, it will be necessary to set the ulimit in your shell startup files so that it is effective on the processes that are started on each node.
More specifically: it may not be sufficient to simply execute the following, because the ulimit may not be in effect on all nodes where Open MPI processes will be run:
1 2
shell$ ulimit -l unlimited shell$ mpirun -np 2 my_mpi_application

294. I'm still getting errors about "error registering openib memory"; what do I do? (openib BTL)

Ensure that the limits you've set (see this FAQ entry) are actually being used. There are two general cases where this can happen:

Your memory locked limits are not actually being applied for interactive and/or non-interactive logins.
You are starting MPI jobs under a resource manager / job scheduler that is either explicitly resetting the memory limited or has daemons that were (usually accidentally) started with very small memory locked limits.

That is, in some cases, it is possible to login to a node and not have the "limits" set properly. For example, consider the following post on the Open MPI User's list:

https://www.open-mpi.org/community/lists/users/2006/02/0724.php

In this case, the user noted that the default configuration on his Linux system did not automatically load the pam_limits.so upon rsh-based logins, meaning that the hard and soft limits were not set.

There are also some default configurations where, even though the maximum limits are initially set system-wide in limits.d (or limits.conf on older systems), something during the boot procedure sets the default limit back down to a low number (e.g., 32k). In this case, you may need to override this limit on a per-user basis (described in this FAQ entry), or effectively system-wide by putting ulimit -l unlimited (for Bourne-like shells) in a strategic location, such as:

/etc/init.d/sshd (or wherever the script is that starts up your SSH daemon) and restarting the SSH daemon
In a script in /etc/profile.d, or wherever system-wide shell startup scripts are located (e.g., /etc/profile and /etc/csh.cshrc)

Also, note that resource managers such as Slurm, Torque/PBS, LSF, etc. may affect OpenFabrics jobs in two ways:

Make sure that the resource manager daemons are started with unlimited memlock limits (which may involve editing the resource manager daemon startup script, or some other system-wide location that allows the resource manager daemon to get an unlimited limit of locked memory). Otherwise, jobs that are started under that resource manager will get the default locked memory limits, which are far too small for Open MPI.

*The files in limits.d (or the limits.conf file) do not usually apply to resource daemons!* The limits.s files usually only applies to rsh or ssh-based logins. Hence, daemons usually inherit the system default of maximum 32k of locked memory (which then gets passed down to the MPI processes that they start). To increase this limit, you typically need to modify daemons' startup scripts to increase the limit before they drop root privliedges.

Some resource managers can limit the amount of locked memory that is made available to jobs. For example, Slurm has some fine-grained controls that allow locked memory for only Slurm jobs (i.e., the system's default is low memory lock limits, but Slurm jobs can get high memory lock limits). See these FAQ items on the Slurm web site for more details: propagating limits and using PAM.

Finally, note that some versions of SSH have problems with getting correct values from /etc/security/limits.d/ (or limits.conf) when using privilege separation. You may notice this by ssh'ing into a node and seeing that your memlock limits are far lower than what you have listed in /etc/security/limits.d/ (or limits.conf) (e.g., 32k instead of unlimited). Several web sites suggest disabling privilege separation in ssh to make PAM limits work properly, but others imply that this may be fixed in recent versions of OpenSSH.

If you do disable privilege separation in ssh, be sure to check with your local system administrator and/or security officers to understand the full implications of this change. See this Google search link for more information.

295. Open MPI is warning me about limited registered memory; what does this mean?

OpenFabrics network vendors provide Linux kernel module parameters controlling the size of the size of the memory translation table (MTT) used to map virtual addresses to physical addresses. The size of this table controls the amount of physical memory that can be registered for use with OpenFabrics devices.

With Mellanox hardware, two parameters are provided to control the size of this table:

log_num_mtt (on some older Mellanox hardware, the parameter may be num_mtt, not log_num_mtt): number of memory translation tables
log_mtts_per_seg:

The amount of memory that can be registered is calculated using this formula:

In newer hardware:
    max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE
 
In older hardware:
    max_reg_mem = num_mtt * (2^log_mtts_per_seg) * PAGE_SIZE

*At least some versions of OFED (community OFED, Mellanox OFED, and upstream OFED in Linux distributions) set the default values of these variables FAR too low!* For example, in some cases, the default values may only allow registering 2 GB — even if the node has much more than 2 GB of physical memory.

Mellanox has advised the Open MPI community to increase the log_num_mtt value (or num_mtt value), _not the log_mtts_per_seg value_ (even though an IBM article suggests increasing the log_mtts_per_seg value).

It is recommended that you adjust log_num_mtt (or num_mtt) such that your max_reg_mem value is at least twice the amount of physical memory on your machine (setting it to a value higher than the amount of physical memory present allows the internal Mellanox driver tables to handle fragmentation and other overhead). For example, if a node has 64 GB of memory and a 4 KB page size, log_num_mtt should be set to 24 and (assuming log_mtts_per_seg is set to 1). This will allow processes on the node to register:

1	max_reg_mem = (2^24) * (2^1) * (4 kB) = 128 GB

NOTE: Starting with OFED 2.0, OFED's default kernel parameter values should allow registering twice the physical memory size.

296. I'm using Mellanox ConnectX HCA hardware and seeing terrible latency for short messages; how can I fix this?

Open MPI prior to v1.2.4 did not include specific configuration information to enable RDMA for short messages on ConnectX hardware. As such, Open MPI will default to the safe setting of using send/receive semantics for short messages, which is slower than RDMA.

To enable RDMA for short messages, you can add this snippet to the bottom of the $prefix/share/openmpi/mca-btl-openib-hca-params.ini file:

[Mellanox Hermon]
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708
vendor_part_id = 25408,25418,25428
use_eager_rdma = 1
mtu = 2048

Enabling short message RDMA will significantly reduce short message latency, especially on ConnectX (and newer) Mellanox hardware.

297. How much registered memory is used by Open MPI? Is there a way to limit it? (openib BTL)

Open MPI uses registered memory in several places, and therefore the total amount used is calculated by a somewhat-complex formula that is directly influenced by MCA parameter values.

It can be desirable to enforce a hard limit on how much registered memory is consumed by MPI applications. For example, some platforms have limited amounts of registered memory available; setting limits on a per-process level can ensure fairness between MPI processes on the same host. Another reason is that registered memory is not swappable; as more memory is registered, less memory is available for (non-registered) process code and data. When little unregistered memory is available, swap thrashing of unregistered memory can occur.

Each instance of the openib BTL module in an MPI process (i.e., one per HCA port and LID) will use up to a maximum of the sum of the following quantities:

Description	Amount	Explanation
User memory	`mpool_rdma_rcache_size_limit`	By default Open MPI will register as much user memory as necessary (upon demand). However, if `mpool_rdma_cache_size_limit` is greater than zero, it is the upper limit (in bytes) of user memory that will be registered. User memory is registered for ongoing MPI communications (e.g., long message sends and receives) and via the MPI_ALLOC_MEM function. Note that this MCA parameter was introduced in v1.2.1.
Internal eager fragment buffers	2 × `btl_openib_free_list_max` × (`btl_openib_eager_limit` + overhead)	A "free list" of buffers used in the `openib` BTL for "eager" fragments (e.g., the first fragment of a long message). Two free lists are created; one for sends and one for receives. By default, `btl_openib_free_list_max` is -1, and the list size is unbounded, meaning that Open MPI will try to allocate as many registered buffers as it needs. If `btl_openib_free_list_max` is greater than 0, the list will be limited to this size. Each entry in the list is approximately `btl_openib_eager_limit` bytes — some additional overhead space is required for alignment and internal accounting. `btl_openib_eager_limit` is the maximum size of an eager fragment.
Internal send/receive buffers	2 × `btl_openib_free_list_max` × (`btl_openib_max_send_size` + overhead)	A "free list" of buffers used for send/receive communication in the `openib` BTL. Two free lists are created; one for sends and one for receives. By default, `btl_openib_free_list_max` is -1, and the list size is unbounded, meaning that Open MPI will allocate as many registered buffers as it needs. If `btl_openib_free_list_max` is greater than 0, the list will be limited to this size. Each entry in the list is approximately `btl_openib_max_send_size` bytes — some additional overhead space is required for alignment and internal accounting. `btl_openib_max_send_size` is the maximum size of a send/receive fragment.
Internal "eager" RDMA buffers	`btl_openib_eager_rdma_num` × `btl_openib_max_eager_rdma` × (`btl_openib_eager_limit` + overhead)	If `btl_openib_user_eager_rdma` is true, RDMA buffers are used for eager fragments (because RDMA semantics can be faster than send/receive semantics in some cases), and an additional set of registered buffers is created (as needed). Each MPI process will use RDMA buffers for eager fragments up to `btl_openib_eager_rdma_num` MPI peers. Upon receiving the `btl_openib_eager_rdma_threshhold`'th message from an MPI peer process, if both sides have not yet setup `btl_openib_eager_rdma_num` sets of eager RDMA buffers, a new set will be created. The set will contain `btl_openib_max_eager_rdma` buffers; each buffer will be `btl_openib_eager_limit` bytes (i.e., the maximum size of an eager fragment).

In general, when any of the individual limits are reached, Open MPI will try to free up registered memory (in the case of registered user memory) and/or wait until message passing progresses and more registered memory becomes available.

Use the ompi_info command to view the values of the MCA parameters described above in your Open MPI installation:

1
2
3

# Note that Open MPI v1.8 and later require the "--level 9"
# CLIP option to display all available MCA parameters.
shell$ ompi_info --param btl openib --level 9

See this FAQ entry for information on how to set MCA parameters at run-time.

298. How do I get Open MPI working on Chelsio iWARP devices? (openib BTL)

Please see this FAQ entry for an important note about iWARP support (particularly for Open MPI versions starting with v5.0.0).

For the Chelsio T3 adapter, you must have at least OFED v1.3.1 and Chelsio firmware v6.0. Download the firmware from service.chelsio.com and put the uncompressed t3fw-6.0.0.bin file in /lib/firmware. Then reload the iw_cxgb3 module and bring up the ethernet interface to flash this new firmware. For example:

# Note that the URL for the firmware may change over time
shell# cd /lib/firmware
shell# wget http://service.chelsio.com/drivers/firmware/t3/t3fw-6.0.0.bin.gz
[...wget output...]
shell# gunzip t3fw-6.0.0.bin.gz
shell# rmmod iw_cxgb3 cxgb3
shell# modprobe iw_cxgb3
 
# This last step *may* happen automatically, depending on your
# Linux distro (assuming that the ethernet interface has previously
# been properly configured and is ready to bring up).  Substitute the
# proper ethernet interface name for your T3 (vs. ethX).
shell# ifup ethX

If all goes well, you should see a message similar to the following in your syslog 15-30 seconds later:

1	kernel: cxgb3 0000:0c:00.0: successful upgrade to firmware 6.0.0

Open MPI will work without any specific configuration to the openib BTL. Users wishing to performance tune the configurable options may wish to inspect the receive queue values. Those can be found in the "Chelsio T3" section of mca-btl-openib-hca-params.ini.

299. I'm getting "ibv_create_qp: returned 0 byte(s) for max inline data" errors; what is this, and how do I fix it? (openib BTL)

Prior to Open MPI v1.0.2, the OpenFabrics (then known as "OpenIB") verbs BTL component did not check for where the OpenIB API could return an erroneous value (0) and it would hang during startup. Starting with v1.0.2, error messages of the following form are reported:

1 2	[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] ibv_create_qp: returned 0 byte(s) for max inline data

This is caused by an error in older versions of the OpenIB user library. Upgrading your OpenIB stack to recent versions of the OpenFabrics software should resolve the problem. See this post on the Open MPI user's list for more details:

https://www.open-mpi.org/community/lists/users/2006/03/0737.php

300. My bandwidth seems [far] smaller than it should be; why? Can this be fixed? (openib BTL)

Open MPI, by default, uses a pipelined RDMA protocol. Additionally, in the v1.0 series of Open MPI, small messages use send/receive semantics (instead of RDMA — small message RDMA was added in the v1.1 series). For some applications, this may result in lower-than-expected bandwidth. However, Open MPI also supports caching of registrations in a most recently used (MRU) list — this bypasses the pipelined RDMA and allows messages to be sent faster (in some cases).

For version the v1.1 series, see this FAQ entry for more information about small message RDMA, its effect on latency, and how to tune it.

To enable the "leave pinned" behavior, set the MCA parameter mpi_leave_pinned to 1. For example:

1	shell$ mpirun --mca mpi_leave_pinned 1 ...

NOTE: The mpi_leave_pinned parameter was broken in Open MPI v1.3 and v1.3.1 (see this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.

This will enable the MRU cache and will typically increase bandwidth performance for applications which reuse the same send/receive buffers.

NOTE: The v1.3 series enabled "leave pinned" behavior by default when applicable; it is usually unnecessary to specify this flag anymore.

301. How do I tune small messages in Open MPI v1.1 and later versions? (openib BTL)

Starting with Open MPI version 1.1, "short" MPI messages are sent, by default, via RDMA to a limited set of peers (for versions prior to v1.2, only when the shared receive queue is not used). This provides the lowest possible latency between MPI processes.

However, this behavior is not enabled between all process peer pairs because it can quickly consume large amounts of resources on nodes (specifically: memory must be individually pre-allocated for each process peer to perform small message RDMA; for large MPI jobs, this can quickly cause individual nodes to run out of memory). Outside the limited set of peers, send/receive semantics are used (meaning that they will generally incur a greater latency, but not consume as many system resources).

This behavior is tunable via several MCA parameters:

btl_openib_use_eager_rdma (default value: 1): These both default to 1, meaning that the small message behavior described above (RDMA to a limited set of peers, send/receive to everyone else) is enabled. Setting these parameters to 0 disables all small message RDMA in the openib BTL component.

btl_openib_eager_rdma_threshold (default value: 16): This is the number of short messages that must be received from a peer before Open MPI will setup an RDMA connection to that peer. This mechanism tries to setup RDMA connections only to those peers who will frequently send around a lot of short messages (e.g., avoid consuming valuable RDMA resources for peers who only exchange a few "startup" control messages).

btl_openib_max_eager_rdma (default value: 16): This parameter controls the maximum number of peers that can receive an RDMA connection for short messages. It is not advisable to change this value to a very large number because the polling time increase with the number of the connections; as a direct result, short message latency will increase.

btl_openib_eager_rdma_num (default value: 16): This parameter controls the maximum number of pre-allocated buffers allocated to each peer for small messages.

btl_openib_eager_limit (default value: 12k): The maximum size of small messages (in bytes).

Note that long messages use a different protocol than short messages; messages over a certain size always use RDMA. Long messages are not affected by the btl_openib_use_eager_rdma MCA parameter.

Also note that, as stated above, prior to v1.2, small message RDMA is not used when the shared receive queue is used.

302. How do I tune large message behavior in Open MPI the v1.2 series? (openib BTL)

Note that this answer generally pertains to the Open MPI v1.2 series. Later versions slightly changed how large messages are handled.

Open MPI uses a few different protocols for large messages. Much detail is provided in this paper.

The btl_openib_flags MCA parameter is a set of bit flags that influences which protocol is used; they generally indicate what kind of transfers are allowed to send the bulk of long messages. Specifically, these flags do not regulate the behavior of "match" headers or other intermediate fragments.

The following flags are available:

Use send/receive semantics (1): Allow the use of send/receive semantics.
Use PUT semantics (2): Allow the sender to use RDMA writes.
Use GET semantics (4): Allow the receiver to use RDMA reads.

Open MPI defaults to setting both the PUT and GET flags (value 6).

Open MPI uses the following long message protocols:

RDMA Direct: If RDMA writes or reads are allowed by btl_openib_flags and the sender's message is already registered (either by use of the mpi_leave_pinned MCA parameter or if the buffer was allocated via MPI_ALLOC_MEM), a slightly simpler protocol is used:
1. Send the "match" fragment: the sender sends the MPI message information (communicator, tag, etc.) to the receiver using copy in/copy out semantics. No data from the user message is included in the match header.
2. Use RDMA to transfer the message:
  - If RDMA reads are enabled and only one network connection is available between the pair of MPI processes, once the receiver has posted a matching MPI receive, it issues an RDMA read to get the message, and sends an ACK back to the sender when the transfer has completed.
  - If the above condition is not met, then RDMA writes must be enabled (or we would not have chosen this protocol). The receiver sends an ACK back when a matching MPI receive is posted and the sender issues an RDMA write across each available network link (i.e., BTL module) to transfer the message. The RDMA write sizes are weighted across the available network links. For example, if two MPI processes are connected by both SDR and DDR IB networks, this protocol will issue an RDMA write for 1/3 of the entire message across the SDR network and will issue a second RDMA write for the remaining 2/3 of the message across the DDR network. The sender then sends an ACK to the receiver when the transfer has completed.
  NOTE: Per above, if striping across multiple network interfaces is available, only RDMA writes are used. The reason that RDMA reads are not used is solely because of an implementation artifact in Open MPI; we didn't implement it because using RDMA reads only saves the cost of a short message round trip, the extra code complexity didn't seem worth it for long messages (i.e., the performance difference will be negligible).
Note that the user buffer is not unregistered when the RDMA transfer(s) is (are) completed.

RDMA Pipeline: If RDMA Direct was not used and RDMA writes are allowed by btl_openib_flags and the sender's message is not already registered, a 3-phase pipelined protocol is used:
1. Send the "match" fragment: the sender sends the MPI message information (communicator, tag, etc.) and the first fragment of the user's message using copy in/copy out semantics.
2. Send "intermediate" fragments: once the receiver has posted a matching MPI receive, it sends an ACK back to the sender. The sender and receiver then start registering memory for RDMA. To cover the cost of registering the memory, several more fragments are sent to the receiver using copy in/copy out semantics.
3. Transfer the remaining fragments: once memory registrations start completing on both the sender and the receiver (see the paper for details), the sender uses RDMA writes to transfer the remaining fragments in the large message.
Note that phases 2 and 3 occur in parallel. Each phase 3 fragment is unregistered when its transfer completes (see the paper for more details).
Also note that one of the benefits of the pipelined protocol is that large messages will naturally be striped across all available network interfaces.
The sizes of the fragments in each of the three phases are tunable by the MCA parameters shown in the figure below (all sizes are in units of bytes):
Send/Receive: If RDMA Direct and RDMA Pipeline were not used, copy in/copy out semantics are used for the whole message (note that this will happen even if the SEND flag is not set in btl_openib_flags):
1. Send the "match" fragment: the sender sends the MPI message information (communicator, tag, etc.) and the first fragment of the user's message using copy in/copy out semantics.
2. Send remaining fragments: once the receiver has posted a matching MPI receive, it sends an ACK back to the sender. The sender then uses copy in/copy out semantics to send the remaining fragments to the receiver.
This protocol behaves the same as the RDMA Pipeline protocol when the btl_openib_min_rdma_size value is infinite.

303. How do I tune large message behavior in the Open MPI v1.3 (and later) series? (openib BTL)

The Open MPI v1.3 (and later) series generally use the same protocols for sending long messages as described for the v1.2 series, but the MCA parameters for the RDMA Pipeline protocol were both moved and renamed (all sizes are in units of bytes):

The change to move the "intermediate" fragments to the end of the message was made to better support applications that call fork(). Specifically, there is a problem in Linux when a process with registered memory calls fork(): the registered memory will physically not be available to the child process (touching memory in the child that is registered in the parent will cause a segfault or other error). Because memory is registered in units of pages, the end of a long message is likely to share the same page as other heap memory in use by the application. If this last page of the large message is registered, then all the memory in that page — to include other buffers that are not part of the long message — will not be available to the child. By moving the "intermediate" fragments to the end of the message, the end of the message will be sent with copy in/copy out semantics and, more importantly, will not have its page registered. This increases the chance that child processes will be able to access other memory in the same page as the end of the large message without problems.

Some notes about these parameters:

btl_openib_rndv_eager_limit defaults to the same value as btl_openib_eager_limit (the size for "small" messages). It is a separate parameter in case you want/need different values.
The btl_openib_min_rdma_size parameter was an absolute offset into the message; it was replaced by btl_openib_rdma_pipeline_send_length, which is a length.

Note that messages must be larger than btl_openib_min_rdma_pipeline_size (a new MCA parameter to the v1.3 series) to use the RDMA Direct or RDMA Pipeline protocols. Messages shorter than this length will use the Send/Receive protocol (even if the SEND flag is not set on btl_openib_flags).

304. How does the mpi_leave_pinned parameter affect large message transfers? (openib BTL)

NOTE: The mpi_leave_pinned parameter was broken in Open MPI v1.3 and v1.3.1 (see this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.

When mpi_leave_pinned is set to 1, Open MPI aggressively tries to pre-register user message buffers so that the RDMA Direct protocol can be used. Additionally, user buffers are left registered so that the de-registration and re-registration costs are not incurred if the same buffer is used in a future message passing operation.

NOTE: Starting with Open MPI v1.3, mpi_leave_pinned is automatically set to 1 by default when applicable. It is therefore usually unnecessary to set this value manually.

NOTE: The mpi_leave_pinned MCA parameter has some restrictions on how it can be set starting with Open MPI v1.3.2. See this FAQ entry for details.

Leaving user memory registered when sends complete can be extremely beneficial for applications that repeatedly re-use the same send buffers (such as ping-pong benchmarks). Additionally, the fact that a single RDMA transfer is used and the entire process runs in hardware with very little software intervention results in utilizing the maximum possible bandwidth.

Leaving user memory registered has disadvantages, however. Bad Things happen if registered memory is free()ed, for example — it can silently invalidate Open MPI's cache of knowing which memory is registered and which is not. The MPI layer usually has no visibility on when the MPI application calls free() (or otherwise frees memory, such as through munmap() or sbrk()). Open MPI has implemented complicated schemes that intercept calls to return memory to the OS. Upon intercept, Open MPI examines whether the memory is registered, and if so, unregisters it before returning the memory to the OS.

These schemes are best described as "icky" and can actually cause real problems in applications that provide their own internal memory allocators. Additionally, only some applications (most notably, ping-pong benchmark applications) benefit from "leave pinned" behavior — those who consistently re-use the same buffers for sending and receiving long messages.

*It is for these reasons that "leave pinned" behavior is not enabled by default.* Note that other MPI implementations enable "leave pinned" behavior by default.

Also note that another pipeline-related MCA parameter also exists: mpi_leave_pinned_pipeline. Setting this parameter to 1 enables the use of the RDMA Pipeline protocol, but simply leaves the user's memory registered when RDMA transfers complete (eliminating the cost of registering / unregistering memory during the pipelined sends / receives). This can be beneficial to a small class of user MPI applications.

305. How does the mpi_leave_pinned parameter affect memory management? (openib BTL)

NOTE: The mpi_leave_pinned parameter was broken in Open MPI v1.3 and v1.3.1 (see this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.

When mpi_leave_pinned is set to 1, Open MPI aggressively leaves user memory registered with the OpenFabrics network stack after the first time it is used with a send or receive MPI function. This allows Open MPI to avoid expensive registration / deregistration function invocations for each send or receive MPI function.

NOTE: The mpi_leave_pinned MCA parameter has some restrictions on how it can be set starting with Open MPI v1.3.2. See this FAQ entry for details.

However, registered memory has two drawbacks:

There is only so much registered memory available.
User applications may free the memory, thereby invalidating Open MPI's internal table of what memory is already registered.

The second problem can lead to silent data corruption or process failure. As such, this behavior must be disallowed. Note that the real issue is not simply freeing memory, but rather returning registered memory to the OS (where it can potentially be used by a different process). Open MPI has two methods of solving the issue:

Using an internal memory manager; effectively overriding calls to malloc(), free(), mmap(), munmap(), etc.
Telling the OS to never return memory from the process to the OS

How these options are used differs between Open MPI v1.2 (and earlier) and Open MPI v1.3 (and later).

306. How does the mpi_leave_pinned parameter affect memory management in Open MPI v1.2? (openib BTL)

Be sure to read this FAQ entry first.

Open MPI 1.2 and earlier on Linux used the ptmalloc2 memory allocator linked into the Open MPI libraries to handle memory deregistration. On Mac OS X, it uses an interface provided by Apple for hooking into the virtual memory system, and on other platforms no safe memory registration was available. The ptmalloc2 code could be disabled at Open MPI configure time with the option --without-memory-manager, however it could not be avoided once Open MPI was built.

ptmalloc2 can cause large memory utilization numbers for a small number of applications and has a variety of link-time issues. Therefore, by default Open MPI did not use the registration cache, resulting in lower peak bandwidth. The inability to disable ptmalloc2 after Open MPI was built also resulted in headaches for users.

Open MPI v1.3 handles leave pinned memory management differently.

307. How does the mpi_leave_pinned parameter affect memory management in Open MPI v1.3? (openib BTL)

NOTE: The mpi_leave_pinned parameter was broken in Open MPI v1.3 and v1.3.1 (see this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.

Be sure to read this FAQ entry first.

NOTE: The mpi_leave_pinned MCA parameter has some restrictions on how it can be set starting with Open MPI v1.3.2. See this FAQ entry for details.

With Open MPI 1.3, Mac OS X uses the same hooks as the 1.2 series, and most operating systems do not provide pinning support. However, the pinning support on Linux has changed. ptmalloc2 is now by default built as a standalone library (with dependencies on the internal Open MPI libopen-pal library), so that users by default do not have the problematic code linked in with their application. Further, if OpenFabrics networks are being used, Open MPI will use the mallopt() system call to disable returning memory to the OS if no other hooks are provided, resulting in higher peak bandwidth by default.

To utilize the independent ptmalloc2 library, users need to add -lopenmpi-malloc to the link command for their application:

1	shell$ mpicc foo.o -o foo -lopenmpi-malloc

Linking in libopenmpi-malloc will result in the OpenFabrics BTL not enabling mallopt() but using the hooks provided with the ptmalloc2 library instead.

To revert to the v1.2 (and prior) behavior, with ptmalloc2 folded into libopen-pal, Open MPI can be built with the --enable-ptmalloc2-internal configure flag.

When not using ptmalloc2, mallopt() behavior can be disabled by disabling mpi_leave_pined:

1	shell$ mpirun --mca mpi_leave_pinned 0 ...

Because mpi_leave_pinned behavior is usually only useful for synthetic MPI benchmarks, the never-return-behavior-to-the-OS behavior was resisted by the Open MPI developers for a long time. Ultimately, it was adopted because a) it is less harmful than imposing the ptmalloc2 memory manager on all applications, and b) it was deemed important to enable mpi_leave_pinned behavior by default since Open MPI performance kept getting negatively compared to other MPI implementations that enable similar behavior by default.

308. How can I set the mpi_leave_pinned MCA parameter? (openib BTL)

NOTE: The mpi_leave_pinned parameter was broken in Open MPI v1.3 and v1.3.1 (see this announcement). mpi_leave_pinned functionality was fixed in v1.3.2.

As with all MCA parameters, the mpi_leave_pinned parameter (and mpi_leave_pinned_pipeline parameter) can be set from the mpirun command line:

1	shell$ mpirun --mca mpi_leave_pinned 1 ...

Prior to the v1.3 series, all the usual methods to set MCA parameters could be used to set mpi_leave_pinned.

However, starting with v1.3.2, not all of the usual methods to set MCA parameters apply to mpi_leave_pinned. Due to various operating system memory subsystem constraints, Open MPI must react to the setting of the mpi_leave_pinned parameter in each MPI process before MPI_INIT is invoked. Specifically, some of Open MPI's MCA parameter propagation mechanisms are not activated until during MPI_INIT — which is too late for mpi_leave_pinned.

As such, only the following MCA parameter-setting mechanisms can be used for mpi_leave_pinned and mpi_leave_pinned_pipeline:

Command line: See the example above.
Environment variable: Setting OMPI_MCA_mpi_leave_pinned to 1 before invoking mpirun.

To be clear: you cannot set the mpi_leave_pinned MCA parameter via Aggregate MCA parameter files or normal MCA parameter files. This is expected to be an acceptable restriction, however, since the default value of the mpi_leave_pinned parameter is "-1", meaning "determine at run-time if it is worthwhile to use leave-pinned behavior." Specifically, if mpi_leave_pinned is set to -1, if any of the following are true when each MPI processes starts, then Open MPI will use leave-pinned bheavior:

Either the environment variable OMPI_MCA_mpi_leave_pinned or OMPI_MCA_mpi_leave_pinned_pipeline is set to a positive value (note that the "mpirun --mca mpi_leave_pinned 1 ..." command-line syntax simply results in setting these environment variables in each MPI process)
Any of the following files / directories can be found in the filesystem where the MPI process is running:
- /sys/class/infiniband
- /dev/open-mx
- /dev/myri[0-9]

Note that if either the environment variable OMPI_MCA_mpi_leave_pinned or OMPI_MCA_mpi_leave_pinned_pipeline is set to to "-1", then the above indicators are ignored and Open MPI will not use leave-pinned behavior.

309. I got an error message from Open MPI about not using the default GID prefix. What does that mean, and how do I fix it? (openib BTL)

Users may see the following error message from Open MPI v1.2:

WARNING: There are more than one active ports on host '%s', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical OFA
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate OFA subnet that is
used between connected MPI processes must have different subnet ID
values.

This is a complicated issue.

What it usually means is that you have a host connected to multiple, physically separate OFA-based networks, at least 2 of which are using the factory-default subnet ID value (FE:80:00:00:00:00:00:00). Open MPI can therefore not tell these networks apart during its reachability computations, and therefore will likely fail. You need to reconfigure your OFA networks to have different subnet ID values, and then Open MPI will function properly.

Please note that the same issue can occur when any two physically separate subnets share the same subnet ID value — not just the factory-default subnet ID value. However, Open MPI only warns about the factory default subnet ID value because most users do not bother to change it unless they know that they have to.

All this being said, note that there are valid network configurations where multiple ports on the same host can share the same subnet ID value. For example, two ports from a single host can be connected to the same network as a bandwidth multiplier or a high-availability configuration. For this reason, Open MPI only warns about finding duplicate subnet ID values, and that warning can be disabled. Setting the btl_openib_warn_default_gid_prefix MCA parameter to 0 will disable this warning.

See this FAQ entry for instructions on how to set the subnet ID.

Here's a more detailed explanation:

Since Open MPI can utilize multiple network links to send MPI traffic, it needs to be able to compute the "reachability" of all network endpoints that it can use. Specifically, for each network endpoint, Open MPI calculates which other network endpoints are reachable.

In OpenFabrics networks, Open MPI uses the subnet ID to differentiate between subnets — assuming that if two ports share the same subnet ID, they are reachable from each other. If multiple, physically separate OFA networks use the same subnet ID (such as the default subnet ID), it is not possible for Open MPI to tell them apart and therefore reachability cannot be computed properly.

310. What subnet ID / prefix value should I use for my OpenFabrics networks?

You can use any subnet ID / prefix value that you want. However, Open MPI v1.1 and v1.2 both require that every physically separate OFA subnet that is used between connected MPI processes must have different subnet ID values.

For example, if you have two hosts (A and B) and each of these hosts has two ports (A1, A2, B1, and B2). If A1 and B1 are connected to Switch1, and A2 and B2 are connected to Switch2, and Switch1 and Switch2 are not reachable from each other, then these two switches must be on subnets with different ID values.

See this FAQ entry for instructions on how to set the subnet ID.

311. How do I set my subnet ID?

It depends on what Subnet Manager (SM) you are using. Note that changing the subnet ID will likely kill any jobs currently running on the fabric!

OpenSM: The SM contained in the OpenFabrics Enterprise Distribution (OFED) is called OpenSM. The instructions below pertain to OFED v1.2 and beyond; they may or may not work with earlier versions.
1. Stop any OpenSM instances on your cluster:
  1
  shell# /etc/init.d/opensm stop
2. Run a single OpenSM iteration:
  1
  shell# opensm -c -o
  The -o option causes OpenSM to run for one loop and exit. The -c option tells OpenSM to create an "options" text file.
3. The OpenSM options file will be generated under /var/cache/opensm/opensm.opts. Open the file and find the line with subnet_prefix. Replace the default value prefix with the new one.
4. Restart OpenSM:
  1
  shell# /etc/init.d/opensm start
  OpenSM will automatically load the options file from the cache repository and will use new prefix.
Cisco High Performance Subnet Manager (HSM): The Cisco HSM has a console application that can dynamically change various characteristics of the IB fabrics without restarting. The Cisco HSM works on both the OFED InfiniBand stack and an older, Cisco-proprietary "Topspin" InfiniBand stack. Please consult the Cisco HSM (or switch) documentation for specific instructions on how to change the subnet prefix.

Other SM: Consult that SM's instructions for how to change the subnet prefix.

312. In a configuration with multiple host ports on the same fabric, what connection pattern does Open MPI use? (openib BTL)

When multiple active ports exist on the same physical fabric between multiple hosts in an MPI job, Open MPI will attempt to use them all by default. Open MPI makes several assumptions regarding active ports when establishing connections between two hosts. Active ports that have the same subnet ID are assumed to be connected to the same physical fabric — that is to say that communication is possible between these ports. Active ports with different subnet IDs are assumed to be connected to different physical fabric — no communication is possible between them. It is therefore very important that if active ports on the same host are on physically separate fabrics, they must have different subnet IDs. Otherwise Open MPI may attempt to establish communication between active ports on different physical fabrics. The subnet manager allows subnet prefixes to be assigned by the administrator, which should be done when multiple fabrics are in use.

The following is a brief description of how connections are established between multiple ports. During initialization, each process discovers all active ports (and their corresponding subnet IDs) on the local host and shares this information with every other process in the job. Each process then examines all active ports (and the corresponding subnet IDs) of every other process in the job and makes a one-to-one assignment of active ports within the same subnet. If the number of active ports within a subnet differ on the local process and the remote process, then the smaller number of active ports are assigned, leaving the rest of the active ports out of the assignment between these two processes. Connections are not established during MPI_INIT, but the active port assignment is cached and upon the first attempted use of an active port to send data to the remote process (e.g., via MPI_SEND), a queue pair (i.e., a connection) is established between these ports. Active ports are used for communication in a round robin fashion so that connections are established and used in a fair manner.

NOTE: This FAQ entry generally applies to v1.2 and beyond. Prior to v1.2, Open MPI would follow the same scheme outlined above, but would not correctly handle the case where processes within the same MPI job had differing numbers of active ports on the same physical fabric.

313. I'm getting lower performance than I expected. Why?

Measuring performance accurately is an extremely difficult task, especially with fast machines and networks. Be sure to read this FAQ entry for many suggestions on benchmarking performance.

_Pay particular attention to the discussion of processor affinity and NUMA systems_ — running benchmarks without processor affinity and/or on CPU sockets that are not directly connected to the bus where the HCA is located can lead to confusing or misleading performance results.

314. I get bizarre linker warnings / errors / run-time faults when I try to compile my OpenFabrics MPI application statically. How do I fix this?

Fully static linking is not for the weak, and is not recommended. But it is possible.

Read both this FAQ entry and this FAQ entry in their entirety.

315. Can I use system(), popen(), or fork() in an MPI application that uses the OpenFabrics support? (openib BTL)

The answer is, unfortunately, complicated. Be sure to also see this FAQ entry as well.

If you have a Linux kernel before version 2.6.16: no. Some distros may provide patches for older versions (e.g, RHEL4 may someday receive a hotfix).

If you have a version of OFED before v1.2: sort of. Specifically, newer kernels with OFED 1.0 and OFED 1.1 may generally allow the use of system() and/or the use of fork() as long as the parent does nothing until the child exits.

If you have a Linux kernel >= v2.6.16 and OFED >= v1.2 and Open MPI >= v1.2.1: yes. Open MPI v1.2.1 added two MCA values relevant to arbitrary fork() support in Open MPI:
- btl_openib_have_fork_support: This is a "read-only" MCA value, meaning that users cannot change it in the normal ways that MCA parameter values are set. It can be queried via the ompi_info command; it will have a value of 1 if this installation of Open MPI supports fork(); 0 otherwise.
- btl_openib_want_fork_support: This MCA parameter can be used to request conditional, absolute, or no fork() support. The following values are supported:
  1. Negative values: try to enable fork support, but continue even if it is not available.
  2. Zero: Do not try to enable fork support.
  3. Positive values: Try to enable fork support and fail if it is not available.

Hence, you can reliably query Open MPI to see if it has support for fork() and force Open MPI to abort if you request fork support and it doesn't have it.

This feature is helpful to users who switch around between multiple clusters and/or versions of Open MPI; they can script to know whether the Open MPI that they're using (and therefore the underlying IB stack) has fork support. For example:

#!/bin/sh
 
have_fork_support=`ompi_info --param btl openib --level 9 --parsable | grep have_fork_support:value | cut -d: -f7`
if test "$have_fork_support" = "1"; then
    # Happiness / world peace / birds are singing
else
    # Despair / time for Häagen-Dazs
fi

Alternatively, you can skip querying and simply try to run your job:

1	shell$ mpirun --mca btl_openib_want_fork_support 1 --mca btl openib,self ...

Which will abort if Open MPI's openib BTL does not have fork support.

All this being said, even if Open MPI is able to enable the OpenFabrics fork() support, it does not mean that your fork()-calling application is safe.

In general, if your application calls system() or popen(), it will likely be safe.

However, note that arbitrary fork() support is not supported in the OpenFabrics software stack. If you use fork() in your application, you must not touch any registered memory before calling some form of exec() to launch another process. Doing so will cause an immediate seg fault / program crash.
It is important to note that memory is registered on a per-page basis; it is therefore possible that your application may have memory co-located on the same page as a buffer that was passed to an MPI communications routine (e.g., MPI_Send() or MPI_Recv()) or some other internally-registered memory inside Open MPI. You may therefore accidentally "touch" a page that is registered without even realizing it, thereby crashing your application.
There is unfortunately no way around this issue; it was intentionally designed into the OpenFabrics software stack. Please complain to the OpenFabrics Alliance that they should really fix this problem!

316. My MPI application sometimes hangs when using the openib BTL; how can I fix this? (openib BTL)

Starting with v1.2.6, the MCA pml_ob1_use_early_completion parameter allows the user (or administrator) to turn off the "early completion" optimization. Early completion may cause "hang" problems with some MPI applications running on OpenFabrics networks, particularly loosely-synchronized applications that do not call MPI functions often. The default is 1, meaning that early completion optimization semantics are enabled (because it can reduce point-to-point latency).

See Open MPI legacy Trac ticket #1224 for further information.

NOTE: This FAQ entry only applies to the v1.2 series. This functionality is not required for v1.3 and beyond because of changes in how message passing progress occurs. Specifically, this MCA parameter will only exist in the v1.2 series.

317. Does InfiniBand support QoS (Quality of Service)?

Yes.

InfiniBand QoS functionality is configured and enforced by the Subnet Manager/Administrator (e.g., OpenSM).

Open MPI (or any other ULP/application) sends traffic on a specific IB Service Level (SL). This SL is mapped to an IB Virtual Lane, and all the traffic arbitration and prioritization is done by the InfiniBand HCAs and switches in accordance with the priority of each Virtual Lane.

For details on how to tell Open MPI which IB Service Level to use, please see this FAQ entry.

318. Does Open MPI support InfiniBand clusters with torus/mesh topologies? (openib BTL)

Yes.

InfiniBand 2D/3D Torus/Mesh topologies are different from the more common fat-tree topologies in the way that routing works: different IB Service Levels are used for different routing paths to prevent the so-called "credit loops" (cyclic dependencies among routing path input buffers) that can lead to deadlock in the network.

Open MPI complies with these routing rules by querying the OpenSM for the Service Level that should be used when sending traffic to each endpoint.

Note that this Service Level will vary for different endpoint pairs.

For details on how to tell Open MPI to dynamically query OpenSM for IB Service Level, please refer to this FAQ entry.

NOTE: 3D-Torus and other torus/mesh IB topologies are supported as of version 1.5.4.

319. How do I tell Open MPI which IB Service Level to use? (openib BTL)

There are two ways to tell Open MPI which SL to use:

By providing the SL value as a command line parameter to the openib BTL
By telling openib BTL to dynamically query OpenSM for SL that should be used for each endpoint

1. Providing the SL value as a command line parameter for the openib BTL

Use the btl_openib_ib_service_level MCA parameter to tell openib BTL which IB SL to use:

1	shell$ mpirun --mca btl openib,self,vader --mca btl_openib_ib_service_level N ...

The value of IB SL N should be between 0 and 15, where 0 is the default value.

NOTE: Open MPI will use the same SL value for all the endpoints, which means that this option is not valid for 3D torus and other torus/mesh IB topologies.

2. Querying OpenSM for SL that should be used for each endpoint

Use the btl_openib_ib_path_record_service_level MCA parameter to tell the openib BTL to query OpenSM for the IB SL that should be used for each endpoint. Open MPI will send a PathRecord query to OpenSM in the process of establishing connection between two endpoints, and will use the IB Service Level from the PathRecord response:

1	shell$ mpirun --mca btl openib,self,sm --mca btl_openib_ib_path_record_service_level 1 ...

NOTE: The btl_openib_ib_path_record_service_level MCA parameter is supported as of version 1.5.4.

320. How do I tell Open MPI which IB Service Level to use? (UCX PML)

In order to tell UCX which SL to use, the IB SL must be specified using the UCX_IB_SL environment variable. For example:

1	shell$ mpirun --mca pml ucx -x UCX_IB_SL=N ...

The value of IB SL N should be between 0 and 15, where 0 is the default value.

321. What is RDMA over Converged Ethernet (RoCE)?

RoCE (which stands for RDMA over Converged Ethernet) provides InfiniBand native RDMA transport (OFA Verbs) on top of lossless Ethernet data link.

Since we're talking about Ethernet, there's no Subnet Manager, no Subnet Administrator, no InfiniBand SL, nor any other InfiniBand Subnet Administration parameters.

Connection management in RoCE is based on the OFED RDMACM (RDMA Connection Manager) service:

The OS IP stack is used to resolve remote (IP,hostname) tuples to a DMAC.
The outgoing Ethernet interface and VLAN are determined according to this resolution.
The appropriate RoCE device is selected accordingly.
Network parameters (such as MTU, SL, timeout) are set locally by the RDMACM in accordance with kernel policy.

322. How do I run Open MPI over RoCE? (openib BTL)

Open MPI can use the OFED Verbs-based openib BTL for traffic and its internal rdmacm CPC (Connection Pseudo-Component) for establishing connections for MPI traffic.

So if you just want the data to run over RoCE and you're not interested in VLANs, PCP, or other VLAN tagging parameters, you can just run Open MPI with the openib BTL and rdmacm CPC:

1	shell$ mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...

(or set these MCA parameters in other ways)

How does Open MPI run with Routable RoCE (RoCEv2)?

Routable RoCE is supported in Open MPI starting v1.8.8.

In order to use it, RRoCE needs to be enabled from the command line.

1	shell$ mpirun --mca pml ob1 --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 ...

How do I tell Open MPI to use a specific RoCE VLAN?

When a system administrator configures VLAN in RoCE, every VLAN is assigned with its own GID. The QP that is created by the rdmacm CPC uses this GID as a Source GID. When Open MPI (or any other application for that matter) posts a send to this QP, the driver checks the source GID to determine which VLAN the traffic is supposed to use, and marks the packet accordingly.

Note that InfiniBand SL (Service Level) is not involved in this process — marking is done in accordance with local kernel policy.

To control which VLAN will be selected, use the btl_openib_ipaddr_include/exclude MCA parameters and provide it with the required IP/netmask values. For example, if you want to use a VLAN with IP 13.x.x.x:

1 2	shell$ mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm \ --mca btl_openib_ipaddr_include "13.0.0.0/8" ...

NOTE: VLAN selection in the Open MPI v1.4 series works only with version v1.4.4 or later.

323. How do I run Open MPI over RoCE? (UCX PML)

In order to use RoCE with UCX, the Ethernet port must be specified using the UCX_NET_DEVICES environment variable. For example:

1	shell$ mpirun --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ...

How does UCX run with Routable RoCE (RoCEv2)?

UCX selects IPV4 RoCEv2 by default. If a different behavior is needed, it's possible to set a speific GID index to use:

1	shell$ mpirun --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_IB_GID_INDEX=1 ...

324. Does Open MPI support XRC? (openib BTL)

Older versions of Open MPI support XRC.

XRC (eXtended Reliable Connection) decreases the memory consumption of Open MPI and improves its scalability by significantly decreasing number of QPs per machine.

XRC is available on Mellanox ConnectX family HCAs with OFED 1.4 and later.

XRC was was removed in the middle of multiple release streams (which were effectively concurrent in time) because there were known problems with it and no one was going to fix it. Here are the versions where XRC support was disabled:

In then 2.0.x series, XRC was disabled in v2.0.4.
In then 2.1.x series, XRC was disabled in v2.1.2.
In then 3.0.x series, XRC was disabled prior to the v3.0.0 release.

Specifically: v2.1.1 was the latest release that contained XRC support. Note that it is not known whether it actually works, however.

See this FAQ entry for instructions how to tell Open MPI to use XRC receive queues.

325. How do I specify the type of receive queues that I want Open MPI to use? (openib BTL)

You can use the btl_openib_receive_queues MCA parameter to specify the exact type of the receive queues for the Open MPI to use. This can be advantageous, for example, when you know the exact sizes of messages that your MPI application will use — Open MPI can internally pre-post receive buffers of exactly the right size. See this paper for more details.

The btl_openib_receive_queues parameter takes a colon-delimited string listing one or more receive queues of specific sizes and characteristics. For now, all processes in the job must use the same string. You can specify three kinds of receive queues:

P : Per-Peer Receive Queues
S : Shared Receive Queues (SRQ)
X : eXtended Reliable Connection queues (see this FAQ item to see when XRC support was removed from Open MPI)

The default value of the btl_openib_receive_queues MCA parameter is sometimes equivalent to the following command line:

1	shell$ mpirun ... --mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 ...

In particular, note that XRC is (currently) not used by default (and is no longer supported — see this FAQ item for more information).

NOTE: Open MPI chooses a default value of btl_openib_receive_queues based on the type of OpenFabrics network device that is found. The text file $openmpi_packagedata_dir/mca-btl-openib-device-params.ini (which is typically $openmpi_installation_prefix_dir/share/openmpi/mca-btl-openib-device-params.ini) contains a list of default values for different OpenFabrics devices. See that file for further explanation of how default values are chosen.

Per-Peer Receive Queues

Per-peer receive queues require between 1 and 5 parameters:

Buffer size in bytes: mandatory
Number of buffers: optional; defaults to 8
Low buffer count watermark: optional; defaults to (num_buffers / 2)
Credit window size: optional; defaults to (low_watermark / 2)
Number of buffers reserved for credit messages: optional; defaults to ((num_buffers × 2 - 1) / credit_window)

Example: P,128,256,128,16

128 byte buffers
256 buffers to receive incoming MPI messages
When the number of available buffers reaches 128, re-post 128 more buffers to reach a total of 256
If the number of available credits reaches 16, send an explicit credit message to the sender
Defaulting to ((256 × 2) - 1) / 16 = 31; this many buffers are reserved for explicit credit messages

Shared Receive Queues

Shared Receive Queues can take between 1 and 4 parameters:

Buffer size in bytes: mandatory
Number of buffers: optional; defaults to 16
Low buffer count watermark: optional; defaults to (num_buffers / 2)
Maximum number of outstanding sends a sender can have: optional; defaults to (low_watermark / 4)

Example: S,1024,256,128,32

1024 byte buffers
256 buffers to receive incoming MPI messages
When the number of available buffers reaches 128, re-post 128 more buffers to reach a total of 256
A sender will not send to a peer unless it has less than 32 outstanding sends to that peer

XRC Queues

Note that XRC is no longer supported in Open MPI. See this FAQ item for more details.

XRC queues take the same parameters as SRQs. Note that if you use any XRC queues, then all of your queues must be XRC. Therefore, to use XRC, specify the following:

1	shell$ mpirun --mca btl openib... --mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ...

NOTE: the rdmacm CPC is not supported with XRC. Also, XRC cannot be used when btls_per_lid > 1.
NOTE: the rdmacm CPC cannot be used unless the first QP is per-peer.

326. Does Open MPI support FCA?

Yes.

FCA (which stands for _Fabric Collective Accelerator_) is a Mellanox MPI-integrated software package that utilizes CORE-Direct technology for implementing the MPI collectives communications.

You can find more information about FCA on the product web page. FCA is available for download here: http://www.mellanox.com/products/fca

Building Open MPI 1.5.x or later with FCA support

By default, FCA is installed in /opt/mellanox/fca. Use the following configure option to enable FCA integration in Open MPI:

1	shell$ ./configure --with-fca=/opt/mellanox/fca ...

Verifying the FCA Installation

To verify that Open MPI is built with FCA support, use the following command:

1	shell$ ompi_info --param coll fca --level 9 \| grep fca_enable

A list of FCA parameters will be displayed if Open MPI has FCA support.

How do I tell Open MPI to use FCA?

1	shell$ mpirun --mca coll_fca_enable 1 ...

By default, FCA will be enabled only with 64 or more MPI processes. To turn on FCA for an arbitrary number of ranks ( N ), please use the following MCA parameters:

1	shell$ mpirun --mca coll_fca_enable 1 --mca coll_fca_np <N>; ...

327. Does Open MPI support MXM?

MXM support is currently deprecated and replaced by UCX.

328. Does Open MPI support UCX?

Yes.

UCX is an open-source optimized communication library which supports multiple networks, including RoCE, InfiniBand, uGNI, TCP, shared memory, and others. UCX mixes-and-matches transports and protocols which are available on the system to provide optimal performance. It also has built-in support for GPU transports (with CUDA and RoCM providers) which lets RDMA-capable transports access the GPU memory directly.

Make sure Open MPI was built with UCX support.

UCX is enabled and selected by default; typically, no additional parameters are required. In this case, the network port with the highest bandwidth on the system will be used for inter-node communication, and shared memory will be used for intra-node communication. To select a specific network device to use (for example, mlx5_0 device port 1):

1	shell$ mpirun -x UCX_NET_DEVICES=mlx5_0:1 ...

It's also possible to force using UCX for MPI point-to-point and one-sided operations:

1	shell$ mpirun --mca pml ucx --mca osc ucx ...

For OpenSHMEM, in addition to the above, it's possible to force using UCX for remote memory access and atomic memory operations:

1	shell$ mpirun --mca pml ucx --mca osc ucx --mca scoll ucx --mca atomic ucx ...

329. I'm getting errors about "initializing an OpenFabrics device" when running v4.0.0 with UCX support enabled. What should I do?

The short answer is that you should probably just disable verbs support in Open MPI.

The messages below were observed by at least one site where Open MPI v4.0.0 was built with support for InfiniBand verbs (--with-verbs), OFA UCX (--with-ucx), and CUDA (--with-cuda) with applications running on GPU-enabled hosts:

WARNING: There was an error initializing an OpenFabrics device.

Local host: c36a-s39 Local device: mlx4_0

and

By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true.

Local host: c36a-s39 Local adapter: mlx4_0 Local port: 1

These messages are coming from the openib BTL. As noted in the messages above, the openib BTL (enabled when Open MPI is configured --with-verbs) is deprecated in favor of the UCX PML, which includes support for OpenFabrics devices. The openib BTL is therefore not needed.

You can disable the openib BTL (and therefore avoid these messages) in a few different ways:

Configure Open MPI --without-verbs. This will prevent building the openib BTL in the first place.
Disable the openib BTL via the btl MCA param (see this FAQ item for information on how to set MCA params). For example,
1
shell$ mpirun --mca btl '^openib' ...

Note that simply selecting a different PML (e.g., the UCX PML) is not sufficient to avoid these messages. For example:

1	shell$ mpirun --mca pml ucx ...

You will still see these messages because the openib BTL is not only used by the PML, it is also used in other contexts internally in Open MPI. Hence, it is not sufficient to simply choose a non-OB1 PML; you need to actually disable the openib BTL to make the messages go away.

330. How can I find out what devices and transports are supported by UCX on my system?

Check out the UCX documentation for more information, but you can use the ucx_info command. For example:

shell$ ucx_info -d

331. What is cpu-set?

The --cpu-set parameter allows you to specify the logical CPUs to use in an MPI job.

From mpirun --help: Comma-separated list of ranges specifying logical cpus allocated to this job.

1	shell$ mpirun -cpu-set 0,1,2,3 ...

The hwloc package can be used to get information about the topology on your host. More information about hwloc is available here.

Here is a usage example with hwloc-ls.
Consider the following command line:

1	shell$ mpirun --report-bindings -n 4 --oversubscribe --cpu-set 14,15,0,1 --bind-to hwthread --map-by hwthread hostname

The output of the --report-bindings is:

[clx-orion-001:25028] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../..][../../../../../../../../../../../../../..]
[clx-orion-001:25028] MCW rank 1 bound to socket 0[core 0[hwt 1]]: [.B/../../../../../../../../../../../../..][../../../../../../../../../../../../../..]
[clx-orion-001:25028] MCW rank 2 bound to socket 0[core 7[hwt 0]]: [../../../../../../../B./../../../../../..][../../../../../../../../../../../../../..]
[clx-orion-001:25028] MCW rank 3 bound to socket 0[core 7[hwt 1]]: [../../../../../../../.B/../../../../../..][../../../../../../../../../../../../../..]

The explanation is as follows. When hwloc-ls is run, the output will show the mappings of physical cores to logical ones.
As per the example in the command line, the logical PUs 0,1,14,15 match the physical cores — 0 and 7 (as shown in the map above).

    NUMANode L#0 (P#0 128GB)
        Socket L#0 + L3 L#0 (35MB)
          L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0
            PU L#0 (P#0)
            PU L#1 (P#28)
        ...
        ...
          L2 L#7 (256KB) + L1d L#7 (32KB) + Core L#7
            PU L#14 (P#7)
            PU L#15 (P#35)

It is also possible to use hwloc-calc. The following command line will show all the available logical CPUs on the host:

1	shell$ hwloc-calc all -I pu

The following will show two specific hwthreads specified by physical ids 0 and 1:

1	shell$ hwloc-calc -I pu --physical-input pu:0 pu:1

332. Does Open MPI support connecting hosts from different subnets? (openib BTL)

Yes.

When using InfiniBand, Open MPI supports host communication between separate subnets using the Mellanox IB-Router.

The support for IB-Router is available starting with Open MPI v1.10.3.

To enable routing over IB, follow these steps:

Configure Open MPI with --enable-openib-rdmacm-ibaddr.
Ensure to use an Open SM with support for IB-Router (available in MLNX_OFED starting version 3.3).
Select to use rdmacm with the openib BTL from the mpirun command line.
Set the btl_openib_allow_different_subnets MCA parameter to 1 (it is 0 by default).
Set the btl_openib_gid_index MCA parameter to 1.

For example, to run the IMB benchmark on host1 and host2 which are on separate subents (i.e., they have have different subnet_prefix values), use the following command line:

shell$ mpirun -np 2 --display-map --map-by node -H host1,host2 \
    --mca pml ob1 \
    --mca btl self,sm,openib \
    --mca btl_openib_cpc_include rdmacm \
    --mca btl_openib_if_include mlx5_0:1 \
    --mca btl_openib_gid_index 1 \
    --mca btl_openib_allow_different_subnets 1 \
    ./IMB/src/IMB-MPI1 pingpong

NOTE: The rdmacm CPC cannot be used unless the first QP is per-peer. If the default value of btl_openib_receive_queues is to use only SRQ QPs, please set the first QP in the list to a per-peer QP.

Please see this FAQ entry for more information on this MCA parameter.

333. What versions of Open MPI contain support for uDAPL?

The following versions of Open MPI contain support for uDAPL:

Open MPI series	uDAPL supported
v1.0 series	No
v1.1 series	No
v1.2 series	Yes
v1.3 / v1.4 series	Yes
v1.5 / v1.6 series	Yes
v1.7 and beyond	No

334. What is different between Sun Microsystems ClusterTools 7 and Open MPI in regards to the uDAPL BTL?

Sun's ClusterTools is based off of Open MPI with one significant difference: Sun's ClusterTools includes uDAPL RDMA capabilities in the uDAPL BTL. Open MPI v1.2 uDAPL BTL does not include the RDMA capabilities. These improvements do exist today in the Open MPI main and will be included in future Open MPI releases.

335. What values are expected to be used by the btl_udapl_if_include and btl_udapl_if_exclude MCA parameters?

The uDAPL BTL looks for a match from the uDAPL static registry which is contained in the dat.conf file. Each non commented or blank line is considered an interface. The first field of each interface entry is the value which must be supplied to the MCA parameter in question.

Solaris Example:

1
2
3

shell% datadm -v
ibd0  u1.2  nonthreadsafe  default  udapl_tavor.so.1  SUNW.1.0  " "  "driver_name=tavor"
shell% mpirun --mca btl_udapl_if_include ibd0 ...

Linux Example:

shell% cat /etc/dat.conf
OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so dapl.1.2 "ib0 0" ""
OpenIB-bond u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so dapl.1.2 "bond0 0" ""
shell% mpirun --mca btl_udapl_if_exclude OpenIB-bond ...

336. Where is the static uDAPL Registry found?

Solaris: /etc/dat/dat.conf

Linux: /etc/dat.conf

337. How come the value reported by ifconfig is not accepted by the btl_udapl_if_include/btl_udapl_if_exclude MCA parameter?

uDAPL queries a static registry defined in the dat.conf file to find available interfaces which can be used. As such, the uDAPL BTL needs to match the names found in the registry and these may differ from what is reported by ifconfig.

338. I get a warning message about not being able to register memory and possibly out of privileged memory while running on Solaris; what can I do?

The error message probably looks something like this:

1
2
3

WARNING: The uDAPL BTL is not able to register memory. Possibly out of
allowed privileged memory (i.e. memory that can be pinned). Increasing
the allowed privileged memory may alleviate this issue.

One thing to do is increase the amount of available privileged memory. On Solaris your system adminstrator can increase the amount of available privileged memory by editing the /etc/project file on the nodes. For more information see the Solaris project man page.

1	shell% man project

As an example of increasing the privileged memory, first determine the amount available (example of typical value is 978 MB):

shell% prctl -n project.max-device-locked-memory -i project default
NAME    PRIVILEGE       VALUE    FLAG   ACTION          RECIPIENT
project.max-device-locked-memory
        privileged       978MB      -   deny            -
        system          16.0EB    max   deny            -

To increase the amount of privileged memory, edit the /etc/project file:

Default /etc/project file.

system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::

Change to, for example, 4 GB.

system:0::::
user.root:1::::
noproject:2::::
default:3::::project.max-device-locked-memory=(priv, 4294967296, deny)
group.staff:10::::

339. What is special about MPI performance analysis?

The synchronization among the MPI processes can be a key performance concern. For example, if a serial program spends a lot of time in function foo(), you should optimize foo(). In contrast, if an MPI process spends a lot of time in MPI_Recv(), not only is the optimization target probably not MPI_Recv(), but you should in fact probably be looking at some other process altogether. You should ask, "What is happening on other processes when this process has the long wait?"

Another issue is that a parallel program (in the case of MPI, a multi-process program) can generate much more performance data than a serial program due to the greater number of execution threads. Managing that data volume can be a challenge.

340. What are "profiling" and "tracing"?

These terms are sometimes used to refer to two different kinds of performance analysis.

In profiling, one aggregates statistics at run time — e.g., total amount of time spent in MPI, total number of messages or bytes sent, etc. Data volumes are small.

In tracing, an event history is collected. It is common to display such event history on a timeline display. Tracing data can provide much interesting detail, but data volumes are large.

341. How do I sort out busy wait time from idle wait, user time from system time, and so on?

Don't.

MPI synchronization delays, which are key performance inhibitors you will probably want to study, can show up as user or system time, all depending on the MPI implementation, the type of wait, what run-time settings you have chosen, etc. In many cases, it makes most sense for you just to distinguish between time spent inside MPI from time spent outside MPI.

Elapsed wall clock time will probably be your key metric. Exactly how the MPI implementation spends time waiting is less important.

342. What is PMPI?

PMPI refers to the MPI standard profiling interface.

Each standard MPI function can be called with an MPI_ or PMPI_ prefix. For example, you can call either MPI_Send() or PMPI_Send(). This feature of the MPI standard allows one to write functions with the MPI_ prefix that call the equivalent PMPI_ function. Specifically, a function so written has the behavior of the standard function plus any other behavior one would like to add. This is important for MPI performance analysis in at least two ways.

First, many performance analysis tools take advantage of PMPI. They capture the MPI calls made by your program. They perform the associated message-passing calls by calling PMPI functions, but also capture important performance data.

Second, you can use such wrapper functions to customize MPI behavior. E.g., you can add barrier operations to collective calls, write out diagnostic information for certain MPI calls, etc.

OMPI generally layers the various function interfaces as follows:

Fortran MPI_ interfaces are weak symbols for...
Fortran PMPI_ interfaces, which call...
C MPI_ interfaces, which are weak symbols for...
C PMPI_ interfaces, which provide the specified functionality.

Since OMPI generally implements MPI functionality for all languages in C, you only need to provide profiling wrappers in C, even if your program is in another programming language. Alternatively, you may write the wrappers in your program's language, but if you provide wrappers in both languages then both sets will be invoked.

There are a handful of exceptions. For example, MPI_ERRHANDLER_CREATE() in Fortran does not call MPI_Errhandler_create(). Instead, it calls some other low-level function. Thus, to intercept this particular Fortran call, you need a Fortran wrapper.

Be sure you make the library dynamic. A static library can experience the linker problems described in the Complications section of the Profiling Interface chapter of the MPI standard.

See the section on Profiling Interface in the MPI standard for more details.

343. Should I use those switches --enable-mpi-profile and --enable-trace when I configure OMPI?

Probably not.

The --enable-mpi-profile switch enables building of the PMPI interfaces. While this is important for performance analysis, this setting is already turned on by default.

The --enable-trace enables internal tracing of OMPI/ORTE/OPAL calls. It is used only for developer debugging, not MPI application performance tracing.

344. What support does OMPI have for performance analysis?

The OMPI source base has some instrumentation to capture performance data, but that data must be analyzed by other non-OMPI tools.

PERUSE was a proposed MPI standard that gives information about low-level behavior of MPI internals. Check the PERUSE web site for any information about analysis tools. When you configure OMPI, be sure to use --enable-peruse. Information is available describing its integration with OMPI.

Unfortunately, PERUSE didn't win standardization, so it didn't really go anywhere. Open MPI may drop PERUSE support at some point in the future.

MPI-3 standardized the MPI_T tools interface API (see Chapter 14 in the MPI-3.0 specification). MPI_T is fully supported starting with v1.7.3.

VampirTrace traces the entry to and exit from the MPI layer, along with important performance data, writing data using the open OTF format. VT is available freely and can be used with any MPI. Information is available describing its integration with OMPI.

345. How do I view VampirTrace output?

While OMPI includes VampirTrace instrumentation, it does not provide a tool for viewing OTF trace data. There is simply a primitive otfdump utility in the same directory where other OMPI commands (mpicc, mpirun, etc.) are located.

Another simple utility, otfprofile, comes with OTF software and allows you to produce a short profile in LaTeX format from an OTF trace.

The main way to view OTF data is with the Vampir tool. Evaluation licenses are available.

346. Are there MPI performance analysis tools for OMPI that I can download for free?

The OMPI distribution includes no such tools, but some general MPI tools can be used with OMPI.

...we used to maintain a list of links here. But the list changes over time; projects come, and projects go. Your best bet these days is simply to use Google to find MPI tracing and performance analysis tools.

347. Any other kinds of tools I should know about?

Well, there are other tools you should consider. Part of performance analysis is not just analyzing performance per se, but generally understanding the behavior of your program.

As such, debugging tools can help you step through or pry into the execution of your MPI program. Popular tools include TotalView, which can be downloaded for free trial use, and Arm DDT which also provides evaluation copies.

The command-line job inspection tool padb has been ported to ORTE and OMPI.

348. How does Open MPI handle HFS+ / UFS filesystems?

Generally, Open MPI does not care whether it is running from an HFS+ or UFS filesystem. However, the C++ wrapper compiler historically has been called mpiCC, which of course is the same file as mpicc when running on case-insensitive HFS+. During the configure process, Open MPI will attempt to determine if the build filesystem is case sensitive or not, and assume the install file system is the same way. Generally, this is all that is needed to deal with HFS+.

However, if you are building on UFS and installing to HFS+, you should specify --without-cs-fs to configure to make sure Open MPI does not build the mpiCC wrapper. Likewise, if you build on HFS+ and install to UFS, you may want to specify --with-cs-fs to ensure that mpiCC is installed.

349. How do I use the Open MPI wrapper compilers in XCode?

XCode has a non-public interface for adding compilers to XCode. A friendly Open MPI user sent in a configuration file for XCode 2.3 (MPICC.pbcompspec), which will add support for the Open MPI wrapper compilers. The file should be placed in /Library/Application Support/Apple/Developer Tools/Specifications/. Upon starting XCode, this file is loaded and added to the list of known compilers.

To use the mpicc compiler: open the project, get info on the target, click the rules tab, and add a new entry. Change the process rule for "C source files" and select "using MPICC".

Before moving the file, the ExecPath parameter should be set to the location of the Open MPI install. The BasedOn parameter should be updated to refer to the compiler version that mpicc will invoke — generally gcc-4.0 on OS X 10.4 machines.

Thanks to Karl Dockendorf for this information.

350. What versions of Open MPI support XGrid?

XGrid is a batch-scheduling technology that was included in some older versions of OS X. Support for XGrid appeared in the following versions of Open MPI:

Open MPI series	XGrid supported
v1.0 series	Yes
v1.1 series	Yes
v1.2 series	Yes
v1.3 series	Yes
v1.4 and beyond	No

351. How do I run jobs under XGrid?

XGrid support will be built if the XGrid tools are installed.

We unfortunately have little documentation on how to run with XGrid at this point other than a fairly lengthy e-mail that Brian Barrett wrote on the Open MPI user's mailing list:

https://www.open-mpi.org/community/lists/users/2006/01/0539.php

Since Open MPI 1.1.2, we also support authentication using Kerberos. The process is essentially the same, but there is no need to specify the XGRID_PASSWORD field. Open MPI applications will then run as the authenticated user, rather than nobody.

352. Where do I get more information about running under XGrid?

Please write to us on the user's mailing list. Hopefully any replies that we send will contain enough information to create proper FAQs about how to use Open MPI with XGrid.

353. Is Open MPI included in OS X?

Open MPI v1.2.3 was included in some older versions of OS X, starting with version 10.5 (Leopard). It was removed in more recent versions of OS X (we're not sure in which version it disappeared — *but your best bet is to simply download a modern version of Open MPI for your modern version of OS X*).

Note, however, that OS X Leopard does not include a Fortran compiler, so the OS X-shipped version of Open MPI does not include Fortran support.

If you need/want Fortran support, you will need to build your own copy of Open MPI (assumedly when you have a Fortran compiler installed). The Open MPI team strongly recommends not overwriting the OS X-installed version of Open MPI, but rather installing it somewhere else (e.g., /opt/openmpi).

354. How do I not use the OS X-bundled Open MPI?

There are a few reasons you might not want to use the OS X-bundled Open MPI, such as wanting Fortran support, upgrading to a new version, etc.

If you wish to use a community version of Open MPI, You can download and build Open MPI on OS X just like any other supported platform. We strongly recommend not replacing the OS X-installed Open MPI, but rather installing to an alternate location (such as /opt/openmpi).

Once you successfully install Open MPI, be sure to prefix your PATH with the bindir of Open MPI. This will ensure that you are using your newly-installed Open MPI, not the OS X-installed Open MPI. For example:

shell$ wget https://www.open-mpi.org/.../open-mpi....
shell$ tar xf openmpi-<version>.tar.bz2
shell$ cd openmpi-<version>
shell$ ./configure --prefix=/opt/openmpi 2>&1 | tee config.out
[...lots of output...]
shell$ make -j 4 2>&1 | tee make.out
[...lots of output...]
shell$ sudo make install 2>&1 | tee install.out
[...lots of output...]
shell$ export PATH=/opt/openmpi/bin:$PATH
shell$ ompi_info
[...see output from newly-installed Open MPI...]

Of course, you'll want to make your PATH changes permanent. One way to do this is to edit your shell startup files.

Note that there is no need to add Open MPI's libdir to LD_LIBRARY_PATH; Open MPI's shared library build process automatically uses the "rpath" mechanism to automatically find the correct shared libraries (i.e., the ones associated with this build, vs., for example, the OS X-shipped OMPI shared libraries). Also note that we specifically do not recommend adding Open MPI's libdir to DYLD_LIBRARY_PATH.

If you build static libraries for Open MPI, there is an ordering problem such that /usr/lib/libmpi.dylib will be found before $libdir/libmpi.a, and therefore user-linked MPI applications that use mpicc (and friends) will use the "wrong" libmpi. This can be fixed by editing OMPI's wrapper compilers to force the use of the Right libraries, such as with the following flag when configuring Open MPI:

1	shell$ ./configure --with-wrapper-ldflags="-Wl,-search_paths_first" ...

355. I am using Open MPI 2.0.x / v2.1.x and getting an error at application startup. How do I work around this?

On some versions of Mac OS X / macOS Sierra, the default temporary directory location is sufficiently long that it is easy for an application to create file names for temporary files which exceed the maximum allowed file name length. With Open MPI v2.0.x, this can lead to errors like the following at application startup:

1
2
3

shell$ mpirun ... my_mpi_app
[[53415,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../orte/orted/pmix/pmix_server.c at line 264
[[53415,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line

Or you may see something like this (v2.1.x):

shell$ mpirun ... my_mpi_app
PMIx has detected a temporary directory name that results
in a path that is too long for the Unix domain socket:
 
    Temp dir: /var/folders/mg/q0_5yv791yz65cdnbglcqjvc0000gp/T/openmpi-sessions-502@anlextwls026-173_0/53422
 
Try setting your TMPDIR environmental variable to point to
something shorter in length.

The workaround for the Open MPI 2.0.x and v2.1.x release series is to set the TMPDIR environment variable to /tmp or another short directory name.

356. Is AIX a supported operating system for Open MPI?

No. AIX used to be supported, but none of the current Open MPI developers has any platforms that require AIX support for Open MPI.

Since Open MPI is an open source project, its features and requirements are driven by the union of its developers. Hence, AIX support has fallen away because none of us currently use AIX. All this means is that we do not develop or test on AIX; there is no fundamental technology reason why Open MPI couldn't be supported on AIX.

AIX support could certainly be re-instated if someone who wanted AIX support joins the core group of developers and contributes the development and testing to support AIX.

357. Does Open MPI work on AIX?

There have been reports from random users that a small number of changes are required to the Open MPI code base to make it work under AIX. For example, see the following post on the Open MPI user's list, reported by Ricardo Fonseca:

https://www.open-mpi.org/community/lists/users/2007/03/2898.php

358. What is VampirTrace?

NOTE: VampirTrace was only included in Open MPI from v1.3.x through v1.10.x. It was removed in the v2.0.0 release of Open MPI. This FAQ question pertains to the versions of Open MPI that contained VampirTrace.

VampirTrace is a program tracing package that can collect a very fine grained event trace of your sequential or parallel program. The traces can be visualized by the Vampir tool and a number of other tools that read the Open Trace Format (OTF).

Tracing is interesting for performance analysis and optimization of parallel and HPC (High Performance Computing) applications in general and MPI programs in particular. In fact, that's where the letters 'mpi' in "Vampir" come from. Therefore, it is integrated into Open MPI for convenience.

VampirTrace is included in Open MPI v1.3 and later.

VampirTrace consists of two main components: First, the instrumentation part which slightly modifies the target program in order to be notified about run-time events of interest. Simply replace the compiler wrappers to activate it: mpicc to mpicc-vt, mpicxx to mpicxx-vt and so on (note that the *-vt variants of the wrapper compilers are unavailable before Open MPI v1.3). Second, the run-time measurement part is responsible for data collection. This can only be effective when the first part was performed — otherwise there will be no effect on your program at all.

VampirTrace has been developed at ZIH, TU Dresden in collaboration with the KOJAK project from JSC/FZ Juelich and is available as open source software under the BSD license; see ompi/contrib/vt/vt/COPYING.

The software is also available as a stand-alone source code package. The latest version can always be found at http://www.tu-dresden.de/zih/vampirtrace/.

359. Where can I find the complete documentation of VampirTrace?

A complete documentation of VampirTrace comes with the Open MPI software package as PDF and HTML. You can find it in the Open MPI source tree at ompi/contrib/vt/vt/doc/ or after installing Open MPI in $(install-prefix)/share/vampirtrace/doc/.

360. How to instrument my MPI application with VampirTrace?

All the necessary instrumentation of user functions as well as MPI and OpenMP events is handled by special compiler wrappers ( mpicc-vt, mpicxx-vt, mpif77-vt, mpif90-vt ). Unlike the normal wrappers ( mpicc and friends) these wrappers call VampirTrace's compiler wrappers ( vtcc, vtcxx, vtf77, vtf90 ) instead of the native compilers. The vt* wrappers use underlying platform compilers to perform the necessary instrumentation of the program and link the suitable VampirTrace library.

Original:

1	shell$ mpicc -c hello.c -o hello

With instrumentation:

1	shell$ mpicc-vt -c hello.c -o hello

For your application, simply change the compiler definitions in your Makefile(s):

# original definitions in Makefile
## CC=mpicc
## CXX=mpicxx
## F90=mpif90
 
# replace with
CC=mpicc-vt
CXX=mpicxx-vt
F90=mpif90-vt

361. Does VampirTrace cause overhead to my application?

By using the default MPI compiler wrappers ( mpicc, etc.) your application will be run without any changes at all. The VampirTrace compiler wrappers ( mpicc-vt etc.) link the VampirTrace library which intercepts MPI calls and some user level function/subroutine calls. This causes a certain amount of run-time overhead to applications. Usually, the overhead is reasonably small (0.x% - 5%) and VampirTrace by default enables precautions to avoid excessive overhead. However, it can be configured to produce very substantial overhead using non-default settings.

362. How can I change the underlying compiler of the mpi*-vt wrappers?

Unlike the standard MPI compiler wrappers ( mpicc etc.) the environment variables OMPI_CC, OMPI_CXX, OMPI_F77, OMPI_F90 do not affect the VampirTrace compiler wrappers. Please, use the environment variables VT_CC, VT_CXX, VT_F77, VT_F90 instead. In addition, you can set the compiler with the wrapper's option -vt:[cc|cxx|f77|f90]

The following two are equivalent, setting the underlying compiler to gcc:

1 2	shell$ VT_CC=gcc mpicc-vt -c hello.c -o hello shell$ mpicc-vt -vt:cc gcc -c hello.c -o hello

Futhermore, you can modify the default settings in /share/openmpi/mpi*-wrapper-data.txt.

363. How can I pass VampirTrace related configure options through the Open MPI configure?

To give options to the VampirTrace configure script you can add this to the configure option: --with-contrib-vt-flags.

The following example passes the options --with-papi-lib-dir and --with-papi-lib to the VampirTrace configure script to specify the location and name of the PAPI library:

1	shell$ ./configure --with-contrib-vt-flags='--with-papi-lib-dir=/usr/lib64 --with-papi-lib=-lpapi64' ...

364. How to disable the integrated VampirTrace, completely?

By default, the VampirTrace part of Open MPI will be built and installed. If you would like to disable building and installing of VampirTrace, add the value vt to the configure option --enable-contrib-no-build.

1	shell$ ./configure --enable-contrib-no-build=vt ...

FAQ: Rollup of ALL FAQ categories and questions

FAQ:
Rollup of ALL FAQ categories and questions