Open MPI logo

FAQ:
Rollup of ALL FAQ categories and questions

  |   Home   |   Support   |   FAQ   |   all just the FAQ
`

Table of contents:

  1. What is MPI? What is Open MPI?
  2. Where can I learn about MPI? Are there tutorials available?
  3. What are the goals of the Open MPI Project?
  4. Will you allow external involvement?
  5. How is this software licensed?
  6. I want to redistribute Open MPI. Can I?
  7. Preventing forking is a goal; how will you enforce that?
  8. How are 3rd party contributions handled?
  9. Is this just YAMPI (yet another MPI implementation)?
  10. But I love [FT-MPI | LA-MPI | LAM/MPI | PACX-MPI]! Why should I use Open MPI?
  11. What will happen to the prior projects?
  12. What operating systems does Open MPI support?
  13. What hardware platforms does Open MPI support?
  14. What network interconnects does Open MPI support?
  15. What run-time environments does Open MPI support?
  16. Does Open MPI support LSF?
  17. How much MPI does Open MPI support?
  18. Is Open MPI thread safe?
  19. Does Open MPI support 64 bit environments?
  20. Does Open MPI support execution in heterogeneous environments?
  21. Does Open MPI support parallel debuggers?
  22. Can I contribute to Open MPI?
  23. I found a bug! How do I report it?
  24. What license is Open MPI distributed under?
  25. How do I contribute code to Open MPI?
  26. I can't submit an Open MPI Third Party Contribution Agreement; how can I contribute to Open MPI?
  27. What if I don't want my contribution to be free / open source?
  28. I want to fork the Open MPI code base. Can I?
  29. Rats! My contribution was not accepted into the main Open MPI code base. What now?
  30. Open MPI terminology
  31. How do I get a copy of the most recent source code?
  32. Ok, I got a Subversion checkout. Now how do I build it?
  33. What is the main tree layout of the Open MPI source tree? Are there directory name conventions?
  34. Is there more information available?
  35. More coming...
  36. I'm a sysadmin; what do I care about Open MPI?
  37. What hardware / software / run-time environments / networks does Open MPI support?
  38. Do I need multiple Open MPI installations?
  39. What are MCA Parameters? Why would I set them?
  40. Do my users need to have their own installation of Open MPI?
  41. I have power users who will want to override my global MCA parameters; is this possible?
  42. What MCA parameters should I, the system administrator, set?
  43. I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps?
  44. I just upgraded my Myrinet|Infiniband network; do I need to recompile all my MPI apps?
  45. We just upgraded our version of Open MPI; do I need to recompile all my MPI apps?
  46. I have an MPI application compiled for another MPI; will it work with Open MPI?
  47. What is "fault tolerance"?
  48. What fault tolerance techniques does Open MPI plan on supporting?
  49. Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?
  50. Where can I find the fault tolerance development work?
  51. Does Open MPI support end-to-end data reliability in MPI message passing?
  52. How do I build Open MPI?
  53. Wow -- I see a lot of errors during configure. Is that normal?
  54. What are the default build options for Open MPI?
  55. Open MPI was pre-installed on my machine; should I overwrite it with a new version?
  56. Where should I install Open MPI?
  57. Should I install a new version of Open MPI over an old version?
  58. Can I disable Open MPI's use of plugins?
  59. How do I build an optimized version of Open MPI?
  60. Are VPATH and/or parallel builds supported?
  61. Do I need any special tools to build Open MPI?
  62. How do I build Open MPI as a static library?
  63. When I run 'make', it looks very much like the build system is going into a loop.
  64. Configure issues warnings about sed and unterminated commands
  65. Open MPI configured ok, but I get "Makefile:602: *** missing separator" kinds of errrs when building
  66. Open MPI seems to default to building with the GNU compiler set. Can I use other compilers?
  67. Can I pass specific flags to the compilers / linker used to build Open MPI?
  68. I'm trying to build with the Intel compilers, but Open MPI eventually fails to compile with really long error messages. What do I do?
  69. When I build with the Intel compiler suite, linking user MPI applications with the wrapper compilers results in warning messages. What do I do?
  70. I'm trying to build with the IBM compilers, but Open MPI eventually fails to compile. What do I do?
  71. I'm trying to build with the Oracle Solaris Studio (Sun) compilers on Linux, but Open MPI eventually fails to compile. What do I do?
  72. What configure options should I use when building with the Oracle Solaris Studio (Sun) compilers?
  73. When building with the Oracle Solaris Studio 12 Update 1 (Sun) compilers on x86 Linux, the compiler loops on btl_sm.c. Is there a workaround?
  74. How do I build OpenMPI on IBM QS22 cell blade machines with GCC and XLC/XLF compilers?
  75. I'm trying to build with the PathScale 3.0 and 3.1 compilers on Linux, but all Open MPI commands seg fault. What do I do?
  76. All MPI C++ API functions return errors (or otherwise fail) when Open MPI is compiled with the PathScale compilers. What do I do?
  77. How do I build Open MPI with support for Open IB (Infiniband), mVAPI (Infiniband), GM (Myrinet), and/or MX (Myrinet)?
  78. How do I build Open MPI with support for SLURM / XGrid?
  79. How do I build Open MPI with support for SGE?
  80. How do I build Open MPI with support for PBS Pro / Open PBS / Torque?
  81. How do I build Open MPI with support for LoadLeveler?
  82. How do I build Open MPI with support for Platform LSF?
  83. How do I build Open MPI with processor affinity support?
  84. How do I build Open MPI with memory affinity / NUMA support (e.g., libnuma)?
  85. How do I build Open MPI with CUDA-aware support?
  86. How do I not build a specific plugin / component for Open MPI?
  87. What other options to [configure] exist?
  88. Why does compiling the Fortran 90 bindings take soooo long?
  89. Does Open MPI support MPI_REAL16 and MPI_COMPLEX32?
  90. Can I re-locate my Open MPI installation without re-configuring/re-compiling/re-installing from source?
  91. I'm still having problems / my problem is not listed here. What do I do?
  92. In general, how do I build MPI applications with Open MPI?
  93. Wait -- what is mpifort? Shouldn't I use mpif77 and mpif90?
  94. I can't / don't want to use Open MPI's wrapper compilers. What do I do?
  95. How do I override the flags specified by Open MPI's wrapper compilers? (v1.0 series)
  96. How do I override the flags specified by Open MPI's wrapper compilers? (v1.1 series and beyond)
  97. How can I tell what the wrapper compiler default flags are?
  98. Why does "mpicc --showme <some flags>" not show any MPI-relevant flags?
  99. Are there ways to just add flags to the wrapper compilers?
  100. Why don't the wrapper compilers add "-rpath" (or similar) flags by default?
  101. Can I build 100% static MPI applications?
  102. Can I build 100% static OpenFabrics / OpenIB / OFED MPI applications on Linux?
  103. Why does it take soooo long to compile F90 MPI applications?
  104. How do I build BLACS with Open MPI?
  105. How do I build ScaLAPACK with Open MPI?
  106. How do I build PETSc with Open MPI?
  107. How do I build VASP with Open MPI?
  108. Are other language / application bindings available for Open MPI?
  109. What pre-requisites are necessary for running an Open MPI job?
  110. What ABI guarantees does Open MPI provide?
  111. Do I need a common filesystem on all my nodes?
  112. How do I add Open MPI to my PATH and LD_LIBRARY_PATH?
  113. What if I can't modify my PATH and/or LD_LIBRARY_PATH?
  114. How do I launch Open MPI parallel jobs?
  115. How do I run a simple SPMD MPI job?
  116. How do I run an MPMD MPI job?
  117. How do I specify the hosts on which my MPI job runs?
  118. I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why?
  119. How can I diagnose problems when running across multiple hosts?
  120. When I build Open MPI with the Intel compilers, I get warnings about "orted" or my MPI application not finding libimf.so. What do I do?
  121. When I build Open MPI with the PGI compilers, I get warnings about "orted" or my MPI application not finding libpgc.so. What do I do?
  122. When I build Open MPI with the Pathscale compilers, I get warnings about "orted" or my MPI application not finding libmv.so. What do I do?
  123. Can I run non-MPI programs with mpirun / mpiexec?
  124. Can I run GUI applications with Open MPI?
  125. Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?
  126. What other options are available to mpirun?
  127. How do I use the --hostfile option to mpirun?
  128. How do I use the --host option to mpirun?
  129. How do I control how my processes are scheduled across nodes?
  130. I'm not using a hostfile. How are slots calculated?
  131. Can I run multiple parallel processes on a uniprocessor machine?
  132. Can I oversubscribe nodes (run more processes than processors)?
  133. Can I force Agressive or Degraded performance modes?
  134. How do I run with the TotalView parallel debugger?
  135. How do I run with the DDT parallel debugger?
  136. What launchers are available?
  137. How do I specify to the rsh launcher to use rsh or ssh?
  138. How do I run with the SLURM and PBS/Torque launchers?
  139. Can I suspend and resume my job?
  140. How do I run with LoadLeveler?
  141. How do I load libmpi at runtime?
  142. What MPI environmental variables exist?
  143. How do I get my MPI job to wireup its MPI connections right away?
  144. What kind of CUDA support exists in Open MPI?
  145. Open MPI tells me that it fails to load components with a "file not found" error -- but the file is there! Why does it say this?
  146. I see strange messages about missing symbols in my application; what do these mean?
  147. What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it?
  148. Can I build shared libraries on AIX with the IBM XL compilers?
  149. Why am I getting a seg fault in libopal?
  150. Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler?
  151. All my MPI applications segv! Why? (Intel Linux 12.1 compiler)
  152. Why can't I attach my parallel debugger (TotalView, DDT, fx2, etc.) to parallel jobs?
  153. When launching large MPI jobs, I see messages like: mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying
  154. How do I find out what MCA parameters are being seen/used by my job?
  155. How do I debug Open MPI processes in parallel?
  156. What tools are available for debugging in parallel?
  157. How do I run with parallel debuggers?
  158. What controls does Open MPI have that aid in debugging?
  159. Do I need to build Open MPI with compiler/linker debugging flags (such as -g) to be able to debug MPI applications?
  160. Can I use serial debuggers (such as gdb) to debug MPI applications?
  161. My process dies without any output. Why?
  162. What is Memchecker?
  163. What kind of errors can Memchecker find?
  164. How can I use Memchecker?
  165. How to run my MPI application with Memchecker?
  166. Does Memchecker cause performance degradation to my application?
  167. Is Open MPI 'Valgrind-clean' or how can I identify real errors?
  168. Can I make Open MPI use rsh instead of ssh?
  169. What pre-requisites are necessary for running an Open MPI job under rsh/ssh?
  170. How can I make ssh not ask me for a password?
  171. What is a .rhosts file? Do I need it?
  172. Should I use + in my .rhosts file?
  173. What versions of BProc does Open MPI work with?
  174. What pre-requisites are necessary for running an Open MPI job under BProc?
  175. How do I run jobs under Torque / PBS Pro?
  176. Does Open MPI support Open PBS?
  177. How does Open MPI get the list of hosts from Torque / PBS Pro?
  178. What happens if $PBS_NODEFILE is modified?
  179. Can I specify a hostfile or use the --host option to mpirun when running in a Torque / PBS environment?
  180. How do I run with the SGE launcher?
  181. Does the SGE tight integration support the -notify flag to qsub?
  182. Can I suspend and resume my job?
  183. How do I run jobs under SLURM?
  184. Doe Open MPI support "srun -n X my_mpi_application"?
  185. I use SLURM on a cluster with the OpenFabrics network stack. Do I need to do anything special?
  186. Any issues with Slurm 2.6.3?
  187. How do I reduce startup time for jobs on large clusters?
  188. Where should I put my libraries: Network vs. local filesystems?
  189. Static vs shared libraries?
  190. How do I reduce the time to wireup OMPI's out-of-band communication system?
  191. Why is my job failing because of file descriptor limits?
  192. I know my cluster's configuration - how can I take advantage of that knowledge?
  193. What is the Modular Component Architecture (MCA)?
  194. What are MCA parameters?
  195. What frameworks are in Open MPI?
  196. What frameworks are in Open MPI v1.2 (and prior)?
  197. What frameworks are in Open MPI v1.3?
  198. How do I know what components are in my Open MPI installation?
  199. How do I install my own components into an Open MPI installation?
  200. How do I know what MCA parameters are available?
  201. How do I set the value of MCA parameters?
  202. What are Aggregate MCA (AMCA) parameter files?
  203. How do I select which components are used?
  204. What is processor affinity? Does Open MPI support it?
  205. What is memory affinity? Does Open MPI support it?
  206. How do I tell Open MPI to use processor and/or memory affinity?
  207. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.2.x? (What is mpi_paffinity_alone?)
  208. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.3.x? (What are rank files?)
  209. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.4.x? (How do I use the --by* and --bind-to-* options?)
  210. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.5.x?
  211. Does Open MPI support calling fork(), system(), or popen() in MPI processes?
  212. I want to run some performance benchmarks with Open MPI. How do I do that?
  213. I am getting a MPI_Win_free error from IMB-EXT -- what do I do?
  214. What is the sm BTL?
  215. How do I specify use of sm for MPI messages?
  216. How does the sm BTL work?
  217. Why does my MPI job no longer start when there are too many processes on one node?
  218. How do I know what MCA parameters are available for tuning MPI performance?
  219. How can I tune these parameters to improve performance?
  220. Where is the file that sm will mmap in?
  221. Why am I seeing incredibly poor performance with the sm BTL?
  222. Can I use SysV instead of mmap?
  223. How much shared memory will my job use?
  224. How much shared memory do I need?
  225. How can I decrease my shared-memory usage?
  226. How do I specify to use the TCP network for MPI messages?
  227. But wait -- I'm using a high-speed network. Do I have to disable the TCP BTL?
  228. How do I know what MCA parameters are available for tuning MPI performance?
  229. Does Open MPI use the TCP loopback interface?
  230. I have multiple TCP networks on some/all of my cluster nodes. Which ones will Open MPI use?
  231. I'm getting TCP-related errors. What do they mean?
  232. How do I tell Open MPI which TCP interfaces / networks to use?
  233. Does Open MPI open a bunch of sockets during MPI_INIT?
  234. Are there any Linux kernel TCP parameters that I should set?
  235. How does Open MPI know which TCP addresses are routable to each other in Open MPI 1.2?
  236. How does Open MPI know which TCP addresses are routable to each other in Open MPI 1.3 (and beyond)?
  237. Does Open MPI ever close TCP sockets?
  238. Does Open MPI support IP interfaces that have more than one IP address?
  239. Does Open MPI support virtual IP interfaces?
  240. What Myrinet-based components does Open MPI have?
  241. How do I specify to use the Myrinet GM network for MPI messages?
  242. How do I specify to use the Myrinet MX network for MPI messages?
  243. But wait -- I also have a TCP network. Do I need to explicitly disable the TCP BTL?
  244. How do I know what MCA parameters are available for tuning MPI performance?
  245. I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help?
  246. How do I adjust the MX first fragment size? Are there constraints?
  247. What versions of Open MPI contain support for uDAPL?
  248. What is different between Sun Microsystems ClusterTools 7 and Open MPI in regards to the uDAPL BTL?
  249. What values are expected to be used by the btl_udapl_if_include and btl_udapl_if_exclude mca parameter?
  250. Where is the static uDAPL Registry found?
  251. How come the value reported by "ifconfig" is not accepted by the btl_udapl_if_include/btl_udapl_if_exclude MCA parameter?
  252. I get a warning message about not being able to register memory and possibly out of privileged memory while running on Solaris, what can I do?
  253. What is special about MPI performance analysis?
  254. What are "profiling" and "tracing"?
  255. How do I sort out busy wait time from idle wait, user time from system time, and so on?
  256. What is PMPI?
  257. Should I use those switches --enable-mpi-profile and --enable-trace when I configure OMPI?
  258. What support does OMPI have for performance analysis?
  259. How do I view VampirTrace output?
  260. Are there MPI performance analysis tools for OMPI that I can download for free?
  261. Any other kinds of tools I should know about?
  262. How does Open MPI handle HFS+ / UFS filesystems?
  263. How do I use the Open MPI wrapper compilers in XCode?
  264. How do I run jobs under XGrid?
  265. Where do I get more information about running under XGrid?
  266. Is Open MPI included in OS X?
  267. How do I not use the OS X-bundled Open MPI?
  268. Is AIX a supported operating system for Open MPI?
  269. Does Open MPI work on AIX?
  270. What is VampirTrace?
  271. Where can I find the complete documentation of VampirTrace?
  272. How to instrument my MPI application with VampirTrace?
  273. Does VampirTrace cause overhead to my application?
  274. How can I change the underlying compiler of the mpi*-vt wrappers?
  275. How can I pass VampirTrace related configure options through the Open MPI configure?
  276. How to disable the integrated VampirTrace, completely?
  277. v1.7 Series


1. What is MPI? What is Open MPI?

MPI stands for the Message Passing Interface. Written by the MPI Forum (a large committee comprising of a cross-section between industry and research representatives), MPI is a standardized API typically used for parallel and/or distributed computing. The MPI standard is comprised of 2 documents: MPI-1 (published in 1994) and MPI-2 (published in 1996). MPI-2 is, for the most part, additions and extensions to the original MPI-1 specification.

The MPI-1 and MPI-2 documents can be downloaded from the official MPI Forum web site: http://www.mpi-forum.org/.

Open MPI is an open source, freely available implementation of both the MPI-1 and MPI-2 documents. The Open MPI software achieves high performance; the Open MPI project is quite receptive to community input.


2. Where can I learn about MPI? Are there tutorials available?

There are many resources available on the internet for learning MPI.

  • The definitive reference for MPI is the MPI Forum Web site. It has copies of the MPI standards documents and all of the errata. This is not recommended for beginners, but is an invaluable reference.
  • Several books on MPI are available (search your favorite book sellers for availability):
    • MPI: The Complete Reference, Marc Snir et al. (an annotated version of the MPI-1 and MPI-2 standard; a 2 volume set, also known as "The orange book" and "The yellow book")
    • Using MPI, William Gropp et al. (2nd edition, also known as "The purple book")
    • Parallel Programming With MPI, Peter Pacheco
    • ...and others. This is not a definitive list!
  • The "Introduction to MPI" and "Intermediate MPI" tutorials are excellent web-based MPI instruction offered by the NCSA. This is a great place for beginners.
  • The LAM/MPI web site has links to a few tutorials.
  • Last but not least, searching for "MPI tutorial" on Google turns up a wealth of information (some good, some bad)


3. What are the goals of the Open MPI Project?

We have several top-level goals:

  • Create a free, open source, peer-reviewed, production-quality complete MPI-2 implementation.
  • Provide extremely high, competitive performance (latency, bandwidth, ...pick your favorite metric).
  • Directly involve the HPC community with external development and feedback (vendors, 3rd party researchers, users, etc.).
  • Provide a stable platform for 3rd party research and commercial development.
  • Help prevent the "forking problem" common to other MPI projects.
  • Support a wide variety of HPC platforms and environments.

In short, we want to work with and for the HPC community to make a world-class MPI-2 implementation that can be used on a huge number and kind of systems.


4. Will you allow external involvement?

ABSOLUTELY.

Bringing together smart researchers and developers to work on a common product is not only a good idea, it's the open source model. Merging the multiple MPI implementation teams has worked extremely well for us over the past year -- extending this concept to the HPC open source community is the next logical step.

The component architecture that Open MPI is founded upon (see the "Publications" link for papers about this) is designed to foster 3rd party collaboration by enabling independent developers to use Open MPI as a production quality research platform. Although Open MPI is a relatively large code base, it is rarely necessary to learn much more than the interfaces for the component type which you are implementing. Specifically, the component architecture was designed to allow small, discrete implementations of major portions of MPI functionality (e.g., point-to-point messaging, collective communications, run-time environment support, etc.).

We envision at least the following forms of collaboration:

  • Peer review of the Open MPI code base
  • Discussion with Open MPI developers on public mailing lists
  • Direct involvement from HPC software and hardware vendors
  • 3rd parties writing and providing their own Open MPI components


5. How is this software licensed?

The Open MPI code base is licensed under the new BSD license.

That being said, although we are an open source project, we recognize that the everyone does not provide free, open source software. Our collaboration models allow (and encourage!) 3rd parties to write and distribute their own components -- perhaps with a different license, and perhaps even as closed source. This is all perfectly acceptable (and desirable!).


6. I want to redistribute Open MPI. Can I?

Absolutely.

NOTE: We are not lawyers and this is not legal advice.

Please read the Open MPI license (the BSD license). It contains extremely liberal provisions for redistribution.


7. Preventing forking is a goal; how will you enforce that?

By definition, we can't. If someone really wants to fork the Open MPI code base, they can. By virtue of our extremely liberal license, it is possible for anyone to fork at any time.

However, we hope that no one does.

We intend to distinguish ourselves from other projects by:

  • Working with the HPC community to accept best-in-breed improvements and functionality enhancements.
  • Provide a flexible framework and set of APIs that allow a wide-variety of different goals within the same code base through the combinatorial effect of mixing-and-matching different components.

Hence, we hope that no one ever has a reason to fork the main code base. We intend to work with the community to accept the best improvements back into the main code base. And if some developers want to do things to the main code base that are different than the goals of the main Open MPI Project, it is our hope that they can do what they need in components that can be distributed without forking the main Open MPI code base.

Only time will tell if this ambitious plan is feasible, but we're going to work hard to make it a reality!


8. How are 3rd party contributions handled?

Before accepting any code from 3rd parties, we require an original signed contribution agreement from the donator.

These agreements assert that the contributor has the right to donate the code and allow the Open MPI Project to perpetually distribute it under the project's licensing terms.

This prevents a situation where intellectual property gets into the Open MPI code base and then someone later claims that we owe them money for it. Open MPI is a free, open source code base. And we intend it to remain that way.

The Contributing to Open MPI FAQ topic contains more information on this issue.


9. Is this just YAMPI (yet another MPI implementation)?

No!

Open MPI initially represented the merger between three well-known MPI implementations (none of which are being developed any more):

  • FT-MPI from the University of Tennessee
  • LA-MPI from Los Alamos National Laboratory
  • LAM/MPI from Indiana University

with contributions from the PACX-MPI team at the University of Stuttgart.

Each of these MPI implementations excelled in one or more areas. The driving motivation behind Open MPI is to bring the best ideas and technologies from the individual projects and create one world-class open source MPI implementation that excels in all areas.

Open MPI was started with the best of the ideas from these four MPI implementations and ported them to an entirely new code base: Open MPI. This also had the simultaneous effect of enabling us to jettison old, crufty code that was only maintained for historical reasons from each project. We started with a clean slate and decided to "do it Right this time." As such, Open MPI also contains many new designs and methodologies based on (literally) years of MPI implementation experience.

After version 1.0 was released, the Open MPI Project grew to include many other members who have each brought their knowledge, expertise, and resources to Open MPI. Open MPI is now far more than just the best ideas of the founding for MPI implementation projects.


10. But I love [FT-MPI | LA-MPI | LAM/MPI | PACX-MPI]! Why should I use Open MPI?

Here's a few reasons:

  • Open MPI represents the next generation of each of these implementations.
  • Open MPI effectively contains the union of features from each of the previous MPI projects. If you find a feature in one of the prior projects that is not in Open MPI, chances are that it will be soon.
  • The vast majority of our future research and development work will be in Open MPI.
  • All the same developers from your favorite project are working on Open MPI.

Not to worry -- each of the respective teams has a vested interest in bringing over the "best" parts of their prior implementation to Open MPI. Indeed, we would love to migrate each of our current user bases to Open MPI as their time, resources, and constraints allow.

In short: we believe that Open MPI -- its code, methodology, and open source philosophy -- is the future.


11. What will happen to the prior projects?

Only time will tell (we cannot predict the future), but it is likely that each project will eventually either end when funding stops or be used exclusively as a research vehicle. Indeed, some of the projects must continue to exist at least until their existing funding expires.


12. What operating systems does Open MPI support?

We primarily develop Open MPI on Linux, OS X, Solaris (both 32 and 64 on all platforms) and Windows (Windows XP, Windows HPC Server 2003/2008 and also Windows 7 RC).

Open MPI is fairly POSIX-neutral, so it will run without too many modifications on most POSIX-like systems. Hence, if we haven't listed your favorite operating system here, it should not be difficult to get Open MPI to compile and run properly. The biggest obstacle is typically the assembly language, but that's fairly modular and we're happy to provide information about how to port it to new platforms.

It should be noted that we are quite open to accepting patches for operating systems that we do not currently support. If we do not have systems to test these on, we probably will only claim to "unofficially" support those systems.

Microsoft Windows support has been added in v1.3.3, please see the file README.WINDOWS.


13. What hardware platforms does Open MPI support?

Essentially all the common platforms that the operating systems listed in the previous question support.

For example, Linux runs on a wide variety of platforms, and we certainly can't claim to support all of them (e.g., Open MPI does not run in an embedded environment), but we include assembly for support Intel, AMD, and PowerPC chips, for example.


14. What network interconnects does Open MPI support?

Open MPI is based upon a component architecture; support for its MPI point-to-point functionality only utilize a small number of components at run-time. Adding native support for a new network interconnect was specifically designed to be easy.

Here's the list of networks that we natively support for point-to-point communication:

  • TCP / ethernet
  • Shared memory
  • Loopback (send-to-self)
  • Myrinet / GM
  • Myrinet / MX
  • Infiniband / OpenIB
  • Infiniband / mVAPI
  • Portals

Is there a network that you'd like to see supported that is not shown above? Contributions are welcome!


15. What run-time environments does Open MPI support?

Open MPI is layered on top of the Open Run-Time Environment (ORTE), which originally started as a small portion of the Open MPI code base. However, ORTE has effectively spun off into its own sub-project.

ORTE is a modular system that was specifically architected to abstract away the back-end run-time environment (RTE) system, providing a neutral API to the upper-level Open MPI layer. Components can be written for ORTE that allow it to natively utilize a wide variety of back-end RTEs.

ORTE currently natively supports the following run-time environments:

  • Recent versions of BProc (e.g., Clustermatic)
  • Sun Grid Engine
  • PBS Pro, Torque, and Open PBS (the TM system)
  • LoadLeveler
  • LSF
  • POE
  • rsh / ssh
  • SLURM
  • XGrid
  • Yod (Red Storm)

Is there a run-time system that you'd like to use Open MPI with that is not listed above? Component contributions are welcome!


16. Does Open MPI support LSF?

Starting with Open MPI v1.3, yes!

Prior to Open MPI v1.3, Platform released a script-based integration in the LSF 6.1 and 6.2 maintenance packs around November of 2006. If you want this integration, please contact your normal Platform support channels.


17. How much MPI does Open MPI support?

Open MPI 1.2 supports all of MPI-2.0.

Open MPI 1.3 supports all of MPI-2.1.


18. Is Open MPI thread safe?

Support for MPI_THREAD_MULTIPLE (i.e., multiple threads executing within the MPI library) and asynchronous message passing progress (i.e., continuing message passing operations even while no user threads are in the MPI library) has been designed into Open MPI from its first planning meetings.

Support for MPI_THREAD_MULTIPLE is included in the first version of Open MPI, but it is only lightly tested and likely still has some bugs. Support for asynchronous progress is included in the TCP point-to-point device, but it, too, has only had light testing and likely still has bugs.

Completing the testing for full support of MPI_THREAD_MULTIPLE and asynchronous progress is planned in the near future.


19. Does Open MPI support 64 bit environments?

Yes, Open MPI is 64 bit clean. You should be able to use Open MPI on 64 bit architectures and operating systems with no difficulty.


20. Does Open MPI support execution in heterogeneous environments?

As of v1.1, Open MPI requires that the size of C, C++, and Fortran datatypes be the same on all platforms within a single parallel application with the exception of types represented by MPI_BOOL and MPI_LOGICAL -- size differences in these types between processes are properly handled. Endian differences between processes in a single MPI job are properly and automatically handled.

Prior to v1.1, Open MPI did not include any support for data size or endian heterogeneity.


21. Does Open MPI support parallel debuggers?

Yes. Open MPI supports the TotalView API for parallel process attaching, which several parallel debuggers support (e.g., DDT, fx2). As part of v1.2.4 (released in September 2007), Open MPI also supports the TotalView API for viewing message queues in running MPI processes.

See this FAQ entry for details on how to run Open MPI jobs under TotalView, and this FAQ entry for details on how to run Open MPI jobs under DDT.

NOTE: The integration of Open MPI message queue support is problematic with 64 bit versions of TotalView prior to v8.3:

  • The message queues views will be truncated
  • Both the communicators and requests list will be incomplete
  • Both the communicators and requests list may be filled with wrong values (such as an MPI_Send to the destination ANY_SOURCE)

There are two workarounds:

  • Use a 32 bit version of TotalView
  • Upgrade to TotalView v8.3


22. Can I contribute to Open MPI?

YES!

One of the main goals of the Open MPI project is to involve the greater HPC community.

There are many ways to contribute to Open MPI. Here are a few:

  • Subscribe to the mailing lists and become active in the discussions
  • Obtain a source code checkout of Open MPI's code base and start looking through the code (be sure to see the Developers category for technical details about the code base)
  • Write your own components and distribute them yourself (i.e., outside of the main Open MPI distribution)
  • Write your own components and contribute them back to the main code base
  • Contribute bug fixes and feature enhancements to the main code base


23. I found a bug! How do I report it?

First check that this is not already a known issue by checking the FAQ and the mailing list archives. If you can't find your problem mentioned anywhere, it is most helpful if you can create a "recipe" to replicate the bug.

Please see the Getting Help page for more details on submitting bug reports.


24. What license is Open MPI distributed under?

Open MPI is distributed under the BSD license.


25. How do I contribute code to Open MPI?

Similar to the Apache projects, before you contribute any code to the Open MPI code base, you must first print out, sign, and submit an Open MPI Third Party Contribution Agreement.

NOTE: We are not lawyers and this is not legal advice.

We need to have an established intellectual property pedigree of the code in Open MPI. This means being able to ensure that all code included in Open MPI is free, open source, and able to be distributed under the BSD license. This prevents a situation where intellectual property gets into the Open MPI code base and then someone later claims that we owe them money for it. Open MPI is a free, open source code base. And we intend it to remain that way.

We enforce this policy by requiring all code contributors to submit a signed Open MPI Third Party Contribution Agreement before we can accept any code from them. These agreements assert that the contributor has the right to donate the code and allow the Open MPI Project to perpetually distribute it under the project's licensing terms.

There are two versions of this agreement: one for individuals, and one for organizations. Ensure that you use the correct form; for example, some companies own all the code produced by their employees, so even if you write code in your spare time, it may still be the intellectual property of your employer.

Send and original, signed copy to the address on the form.

We must have a copy of this agreement on file before we can accept code into the Open MPI code base.


26. I can't submit an Open MPI Third Party Contribution Agreement; how can I contribute to Open MPI?

Fear not.

Although we cannot accept code from you, there are still plenty of other ways to contribute to Open MPI. Here are some examples:

  • Become an active participant in the mailing lists
  • Write and distribute your own components (remember: Open MPI components can be distributed completely separately from the main Open MPI distribution -- they can be added to existing Open MPI installations, and don't even need to be open source)
  • Report bugs
  • Do a good deed daily


27. What if I don't want my contribution to be free / open source?

No problem.

While we are creating free / open-source software, and we would prefer if everyone's contributions to Open MPI were also free / open-source, we certainly recognize that other organizations have difference goals than us. Such is the reality of software development in today's global economy.

As such, it is perfectly acceptable to make non-free / non-open-source contributions to Open MPI.

We obviously cannot accept such contributions into the main code base, but you are free to distribute plugins, enhancements, etc. as you see fit. Indeed, the the BSD license is extremely liberal in its redistribution provisions.

Please also see this FAQ entry about forking the Open MPI code base.


28. I want to fork the Open MPI code base. Can I?

Yes... but we'd prefer if you didn't.

Although Open MPI's license allows third parties to fork the code base, we would strongly prefer if you did not. Forking is not necessarily a Bad Thing, but history has shown that creating too many forks in MPI implementations leads to massive user and system administrator confusion. We have personally seen parallel environments loaded with tens of MPI implementations, each only slightly different from the others. The users then become responsible for figuring out which MPI they want / need to use, which can be a daunting and confusing task.

We do periodically have "short" forks. Specifically, sometimes an origanization needs to release a version of Open MPI with a specific feature.

If you're thinking of forking the Open MPI code base, please let us know -- let's see if we can work something out so that it is not necessary.


29. Rats! My contribution was not accepted into the main Open MPI code base. What now?

If your contribution was not accepted into the main Open MPI code base, there are likely to be good reasons for it (perhaps technical, perhaps due to licensing restrictions, etc.).

If you wrote a standalone component, you can still distribute this component independent of the main Open MPI distribution. Open MPI components can be installed into existing Open MPI installations. As such, you can distribute your component -- even if it is closed source (e.g., distributed as binary-only) -- via any mechanism you choose, such as on a web site, FTP site, etc.


30. Open MPI terminology

Open MPI is a large project containing many different sub-systems and a relatively large code base. Let's first cover some fundamental terminology in order to make the rest of the discussion easier.

Open MPI has three sections of code:

  • OMPI: The MPI API and supporting logic
  • ORTE: The Open Run-Time Environment (support for different back-end run-time systems)
  • OPAL: The Open Portable Access Layer (utility and "glue" code used by OMPI and ORTE)

There are strict abstraction barriers in the code between these sections. That is, they are compiled into three separate libraries: libmpi, liborte, and libopal with a strict dependency order: OMPI depends on ORTE and OPAL, and ORTE depends on OPAL. More specifically, OMPI executables are linked with:

shell$ mpicc myapp.c -o myapp
# This actually turns into:
shell$ cc myapp.c -o myapp -lmpi -lopen-rte -lopen-pal ...

More system-level libraries may listed after -lopal, but you get the idea.

Strictly speaking, these are not "layers" in the classic software engineering sense (even though it is convenient to refer to them as such). They are listed above in dependency order, but that does not mean that, for example, the OMPI code must go through the ORTE and OPAL code in order to reach the operating system or a network interface.

As such, this code organization more reflects abstractions and software engineering, not a strict hierarchy of functions that must be traversed in order to reach lower layer. For example, OMPI can call OPAL functions directly -- it does not have to go through ORTE. Indeed, OPAL has a different set of purposes than ORTE, so it wouldn't even make sense to channel all OPAL access through ORTE. OMPI can also directly call the operating system as necessary. For example, many top-level MPI API functions are quite performance sensitive; it would not make sense to force them to traverse an abritrarily deep call stack just to move some bytes across a network.

Here's a list of terms that are frequently used in discussions about the Open MPI code base:

Frameworks, components, and modules can be dynamic or static. That is, they can be available as plugins or they may be compiled statically into libraries (e.g., libmpi).


31. How do I get a copy of the most recent source code?

See the instructions here.


32. Ok, I got a Subversion checkout. Now how do I build it?

See the instructions here.


33. What is the main tree layout of the Open MPI source tree? Are there directory name conventions?

There are a few notable top-level directories in the source tree:

  • config/: M4 scripts supporting the top-level configure script mpi.h)
  • etc/: Some miscellaneous text files
  • include/: Top-level include files that will be installed
  • ompi/: The Open MPI code base
  • orte/: The Open RTE code base
  • opal/: The OPAL code base

Each of the three main source directories ([ompi/], orte/, and opal/) generate a top-level library named libmpi, liborte, and libopal, respectively. They can be built as either static or shared libraries. Executables are also produced in subdirectories of some of the trees.

Each of the sub-project source directories have similar (but not identical) directory structures under them:

  • class/: C++-like "classes" (using the OPAL class system) specific to this project
  • include/: Top-level include files specific to this project
  • mca/: MCA frameworks and components specific to this project
  • runtime/: Startup and shutdown of this project at runtime
  • tools/: Executables specific to this project (currently none in OPAL)
  • util/: Random utility code

There are other top-level directories in each of the three sub-projects, each having to do with specific logic and code for that project. For example, the MPI API implementations can be found under ompi/mpi/LANGUAGE, where LANGUAGE is c, cxx, f77, and f90.

The layout of the mca/ trees are strictly defined. They are of the form:

<project>/mca/<framework name>/<component name>/

To be explicit: it is forbidden to have a directory under the mca trees that does not meet this template (with the execption of base directories, explained below). Hence, only framework and component code can be in the mca/ trees.

That is, framework and component names must be valid directory names (and C variables; more on that later). For example, the TCP BTL component is located in the following directory:

ompi/mca/btl/tcp/

The name base is reserved; there cannot be a framework or component named "base." Directories named base are reserved for the implementatio of the MCA and frameworks. Here are a few examples:

# Main implementation of the MCA
opal/mca/base

# Implementation of the paffinity framework
opal/mca/paffinity/base

# Implementation of the pls framework
orte/mca/pls/base

# Implementation of the pml framework
ompi/mca/pml/base

Under these mandated directories, frameworks and/or components may have arbitrary directory structures, however.


34. Is there more information available?

Yes. In early 2006, Cisco hosted an Open MPI workshop where the Open MPI Team provided several days of intensive dive-into-the-code tutorials. The slides from these tutorials are available here.

Additionally, several Greenplum videoed several Open MPI developers discussing Open MPI internals in 2012. The videos are available here.


35. More coming...

There are more questions / answers coming... stay tuned...


36. I'm a sysadmin; what do I care about Open MPI?

Several members of the Open MPI team have strong system administrator backgrounds; we recognize the value of having software that is friendly to system administrators. Here are some of the reasons that Open MPI is attractive for system administrators:

  • Simple, standards-based installation
  • Help reduce the number of MPI installations
  • Ability to set system-level and user-level parameters
  • Scriptable information sources about the Open MPI installation

See the rest of the questions in the FAQ section for more details.


37. What hardware / software / run-time environments / networks does Open MPI support?

See this FAQ category for more information


38. Do I need multiple Open MPI installations?

Yes and no.

Open MPI can handle a variety of different run-time environments (e.g., rsh/ssh, SLURM, PBS, etc.) and a variety of different interconnection networks (e.g., ethernet, Myrinet, Infiniband, etc.) in a single installation. Specifically: because Open MPI is fundamentally powered by a component architecture, plug-ins for all these different run-time systems and interconnect networks can be installed in a single installation tree. The relevant plug-ins will only be used in the environments where they make sense.

Hence, there is no need to have one MPI installation for Myrinet, one MPI installation for Ethernet, one MPI installation for PBS, one MPI installation for rsh, etc. Open MPI can handle all of these in a single installation.

However, there are some issues that Open MPI cannot solve. Binary compatibility between different compilers is such an issue. Let's examine this in a per-language basis (be sure see the big caveat at the end):

  • C: Most C compilers are fairly compatible, such that if you compile Open MPI with one C library and link it to an application that was compiled with a different C compiler, everything "should just work." As such, a single installation of Open MPI should work for most C MPI applications.

  • C++: The same is not necessarily true for C++. Most of Open MPI's C++ code is simply the MPI C++ bindings, and in the default build, they are inlined C++ code, meaning that they should compile on any C++ compiler. Hence, you should be able to have one Open MPI installation for multiple different C++ compilers (we'd like to hear feedback either way). That being said, some of the top-level Open MPI executables are written in C++ (e.g., mpicc, ompi_info, etc.). As such, these applications may require having the C++ run-time support libraries of whatever compiler they were created with in order to run properly. Specifically, if you compile Open MPI with the XYZ C/C++ compiler, you may need to have the XYC C++ run-time libraries installed everywhere you want to run mpicc or oompi_info.

  • Fortran 77: Fortran 77 compilers do something called "symbol mangling," meaning that they change the names of global variables, subroutines, and functions. There are 4 common name mangling schemes in use by Fortran 77 compilers. On many systems (e.g., Linux), Open MPI will automatically support all 4 schemes. As such, a single Open MPI installation should just work with multiple different Fortran compilers. However, on some systems, this is not possible (e.g., OS X), and Open MPI will only support the name mangling scheme of the Fortran 77 compiler that was identified during configure.

    Also, there are two notable exceptions that do not work across Fortran compilers that are "different enough":

    1. The C constants MPI_F_STATUS_IGNORE and MPI_F_STATUSES_IGNORE will only compare properly to Fortran applications that were created with Fortran compilers that that use the same name-mangling scheme as the Fortran compiler that Open MPI was configured with.

    2. Fortran compilers may have different values for the logical .TRUE. constant. As such, any MPI function that uses the fortran LOGICAL type may only get .TRUE. values back that correspond to the the .TRUE. value of the Fortran compiler that Open MPI was configured with.

  • Fortran 90: Similar to C++, linking object files from different Fortran 90 compilers is not likely to work. The F90 MPI module that Open MPI creates will likely only work with the Fortran 90 compiler that was identified during configure.

The big caveat to all of this is that Open MPI will only work with different compilers if all the datatype sizes are the same. For example, even though Open MPI supports all 4 name mangling schemes, the size of the Fortran LOGICAL type may be 1 byte in some compilers and 4 bytes in others. This will likely cause Open MPI to perform unpredictably.

The bottom line is that Open MPI can support all manner of run-time systems and interconnects in a single installation, but supporting multiple compilers "sort of" works (i.e., is subject to trial and error) in some cases, and definitely does not work in other cases. There's unfortunately little that we can do about this -- it's a compiler compatibility issue, and one that compiler authors have little incentive to resolve.


39. What are MCA Parameters? Why would I set them?

MCA parameters are a way to tweak Open MPI's behavior at run-time. For example, MCA parameters can specify:

  • Which interconnect networks to use
  • Which interconnect networks not to use
  • The size different between eager sends and rendezvous protocol sends
  • How many registered buffers to pre-pin (e.g., for GM or mVAPI)
  • The size of the pre-pinned registered buffers
  • ...etc.

It can be quite valuable for a system administrator to play with such values a bit and find an "optimal" setting for a particular operating environment. These values can then be set in a global text file that all users will, by default, inherit when then run Open MPI jobs.

For example, say that you have a cluster with 2 ethernet networks -- one for NFS and other system-level operations, and one for MPI jobs. The system administrator can tell Open MPI to not use the NFS TCP network at a system level, such that when users invoke mpirun or mpiexec to launch their jobs, they will automatically only be using the network meant for MPI jobs.

See the run-time tuning FAQ category for information how to set global MCA parameters.


40. Do my users need to have their own installation of Open MPI?

Usually not. It is typically sufficient for a single Open MPI installation (or perhaps a small number of Open MPI installations, depending on compiler interoperability) to serve an entire parallel operating environment.

Indeed, a system-wide Open MPI installation can be customized on a per-user basis in two important ways:

  • Per-user MCA parameters: Each user can set their own set of MCA parameters, potentially overriding system-wide defaults.
  • Per-user plug-ins: Users can install their own Open MPI plug-ins under $HOME/.openmpi/components. Hence, developers can experiment with new components without de-stabilizing the rest of the users on the system. Or power users can download 3rd party components (perhaps even research-quality components) without affecting other users.


41. I have power users who will want to override my global MCA parameters; is this possible?

Absolutely.

See the run-time tuning FAQ category for information how to set MCA parameters, both at the system level and on a per-user (or per-MPI-job) basis.


42. What MCA parameters should I, the system administrator, set?

This is a difficult question and depends on both your specific parallel setup and the applications that typically run there.

The best thing to do is to use the ompi_info command to see what parameters are available and relevant to you. Specifically, ompi_info can be used to show all the parameters that are available for each plug-in. Two common places that system administrators like to tweak are:

  • Only allow specific networks: Say you have a cluster with a high-speed interconnect (such as Myrinet or Infiniband) and an ethernet network. The high-speed network is intended for MPI jobs; the ethernet network is intended for NFS and other administrative-level jobs. In this case, you can simply turn off Open MPI's TCP support. The "btl" framework contains Open MPI's network support; in this case, you want to disable the tcp plug-in. You can do this by adding the following line in the file $prefix/etc/openmpi-mca-params.conf:

    btl = ^tcp
    

    This tells Open MPI to load all BTL components except tcp.

    Consider another example: your cluster has two TCP networks, one for NFS and administration-level jobs, and another for MPI jobs. You can tell Open MPI to ignore the TCP network used by NFS by adding the following line in the file $prefix/etc/openmpi-mca-params.conf:

    btl_tcp_if_exclude = lo,eth0
    

    The value of this parameter is the device names to exclude. In this case, we're excluding lo (localhost, because Open MPI has its own internal loopback device) and eth0.

  • Tune the parameters for specific networks: Each network plug-in has a variety of different tunable parameters. Use the ompi_info command to see what is available. You show all available parameters with:

    shell$ ompi_info --param all all
    

    Beware: there are many variables available. You can limit the output by showing all the parameters in a specific framework or in a specific plug-in with the command line parameters:

    shell$ ompi_info --param btl all
    

    Shows all the parameters of all BTL components, and:

    shell$ ompi_info --param btl mvapi
    

    Shows all the parameters of just the mvapi BTL component.


43. I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps?

If your installation of Open MPI uses shared libraries and components are standalone plug-in files, then no. If you add a new component (such as support for a new network), Open MPI will simply open the new plugin at run-time -- your applications do not need to be recompiled or re-linked.


44. I just upgraded my Myrinet|Infiniband network; do I need to recompile all my MPI apps?

If your installation of Open MPI uses shared libraries and components are standalone plug-in files, then no. You simply need to recompile the Open MPI components that support that network and re-install them.

More specifically, Open MPI shifts the dependency on the underlying network away from the MPI applications and to the Open MPI plug-ins. This is a major advantage over many other MPI implementations.

MPI applications will simply open the new plugin when they run.


45. We just upgraded our version of Open MPI; do I need to recompile all my MPI apps?

It is unlikely. Most MPI applications solely interact with Open MPI through the standardized MPI API and the constant values it publishes in mpi.h. The MPI-2 API will not change until the MPI Forum changes it.

We will try hard to make Open MPI's mpi.h stable such that the values will not change from release-to-release. While we cannot guarantee that they will stay the same forever, we'll try hard to make it so.


46. I have an MPI application compiled for another MPI; will it work with Open MPI?

It is strongly unlikely. Open MPI does not attempt to interface to other MPI implementations, nor executables that were compiled for them. Sorry!

MPI applications need to be compiled and linked with Open MPI in order to run under Open MPI.


47. What is "fault tolerance"?

The phrase "fault tolerance" means many things to many people. Typical definitions range from user processes dumping vital state to disk periodically to checkpoint/restart of running processes to elaborate recreate-process-state-from-incremental-pieces schemes to ... (you get the idea).

In the scope of Open MPI, we typically define "fault tolerance" to mean the ability to recover from one or more component failures in a well defined manner with either a transparent or application-directed mechanism. Component failures may exhibit themselves as a corrupted transmission over a faulty network interface or the failure of one or more serial or parallel processes due to a processor or node failure. Open MPI strives to provide the application with a consistent system view while still providing a production quality, high performance implementation.

Yes, that's pretty much as all-inclusive as possible -- intentionally so! Remember that in addition to being a production-quality MPI implementation, Open MPI is also a vehicle for research. So while some forms of "fault tolerance" are more widely accepted and used, others are certainly of valid academic interest.


48. What fault tolerance techniques does Open MPI plan on supporting?

Open MPI plans on supporting the following fault tolerance techniques:

  • Coordinated and uncoordinated process checkpoint and restart. Similar to those implemented in LAM/MPI and MPICH-V, respectively.
  • Message logging techniques. Similar to those implemented in MPICH-V
  • Data Reliability and network fault tolerance. Similar to those implemented in LA-MPI
  • User directed, and communicator driven fault tolerance. Similar to those implemented in FT-MPI.

The Open MPI team will not limit their fault tolerance techniques to those mentioned above, but intend on extending beyond them in the future.


49. Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?

Yes. The v1.3 series was the first release series of Open MPI to include support for the transparent, coordinated checkpointing and restarting of MPI processes (similar to LAM/MPI).

Open MPI supports both the the BLCR checkpoint/restart system and a "self" checkpointer that allows applications to perform their own checkpoint/restart functionality while taking advantage of the Open MPI checkpoint/restart infrastructure. For both of these, Open MPI provides a coordinated checkpoint/restart protocol and integration with a variety of network interconnects including shared memory, Ethernet, InfiniBand, and Myrinet.

The implementation introduces a series of new frameworks and components designed to support a variety of checkpoint and restart techniques. This allows us to support the methods described above (application-directed, BLCR, etc.) as well as other kinds of checkpoint/restart systems (e.g., Condor, libckpt) and protocols (e.g., uncoordinated, message induced).

Note: The checkpoint/restart support was last released as part of the v1.6 series. The v1.7 series and the Open MPI trunk do not support this functionality (most of the code is present in the repository, but it is known to be non-functional in most cases). This feature is looking for a maintainer. Interested parties should inquire on the developers mailing list.


50. Where can I find the fault tolerance development work?

The end-to-end MPI message data reliability work is being actively developed on the subversion trunk (i.e., reliable message passing over unreliable networks). See this FAQ entry for more details.

The coordinated checkpoint and restart process fault tolerance work is currently available on the Open MPI development trunk and in the v1.3 release series. For more information about how to use this feature see the following websites:

For information on the Fault Tolerant MPI prototype in Open MPI see the links below:


51. Does Open MPI support end-to-end data reliability in MPI message passing?

The current release of Open MPI does not support end-to-end data reliability in message passing any more than the underlying network already guarantees. Future releases of Open MPI will include explicit data reliability support (i.e., more functionality than is provided by the underlying network).

Specifically, the data reliability ("dr") PML component (available on the trunk, but not yet in a stable release) assumes that the underlying network is unreliable. It can drop / restart connections, retransmit corrupted or lost data, etc. The end effect is that data sent through MPI API functions will be guaranteed to be reliable.

For example, if you're using TCP as a message transport, chances of data corruption are fairly low. However, other interconnects do not guarantee that data will be uncorrupted when traveling across the network. Additionally, there are nonzero possibilities that data can be corrupted while traversing PCI buses, etc. (some corruption errors at this level can be caught/fixed, others cannot). Such errors are not uncommon at high altitudes (!).

Note that such added reliability does incur a performance cost -- latency and bandwidth suffer when Open MPI performs the consistency checks that are necessary to provide such guarantees.

Many clusters/networks will not need data reliability. But some do (e.g., those operating at high altitudes). The dr PML is intended for environments where reliability is an issue; users are willing to tolerate slightly slower applications in order to guarantee that their job does not crash (or worse, produce wrong answers).


52. How do I build Open MPI?

If you have obtained a developer's checkout from Subversion, skip this FAQ question and consult these directions.

For everyone else, in general, all you need to do is expand the tarball, run the provided configure script, and then run "make all install". For example:

shell$ gunzip -c openmpi-1.8.1.tar.gz | tar xf -
shell$ cd openmpi-1.8.1
shell$ ./configure --prefix=/usr/local
<...lots of output...>
shell$ make all install

Note that the configure script supports a lot of different command line options. For example, the --prefix option in the above example tells Open MPI to install under the directory /usr/local/.

Other notable configure options are required to support specific network interconnects and back-end run-time environments. More generally, Open MPI supports a wide variety of hardware and environments, but it sometimes needs to be told where support libraries and header files are located.

Consult the README file in the Open MPI tarball and the output of "configure --help" for specific instructions regarding Open MPI's configure command line options.


53. Wow -- I see a lot of errors during configure. Is that normal?

If configure finishes successfully -- meaning that it generates a bunch of Makefiles at the end -- then yes, it is completely normal.

The Open MPI configure script tests for a lot of things, not all of which are expected to succeed. For example, if you do not have Myrinet's GM library installed, you'll see failures about trying to find the GM library. You'll also see errors and warnings about various operating-system specific tests that are not aimed that the operating system you are running.

These are all normal, expected, and nothing to be concerned about. It just means, for example, that Open MPI will not build Myrinet GM support.


54. What are the default build options for Open MPI?

If you have obtained a developer's checkout from Subversion, you must consult these directions.

The default options for building an Open MPI tarball are:

  • Compile Open MPI with all optimizations enabled
  • Build shared libraries
  • Build components as standalone dynamic shared object (DSO) files (i.e., run-time plugins)
  • Try to find support for all hardware and environments by looking for support libraries and header files in standard locations; skip them if not found

Open MPI's configure script has a large number of options, several of which are of the form --with-<FOO>(=DIR), usually with a corresponding --with-<FOO>-libdir=DIR option. The (=DIR) part means that specifying the directory is optional. Here are some examples (explained in more detail below):

  • --with-openib(=DIR) and --with-openib-libdir=DIR
  • --with-mx(=DIR) and --with-mx-libdir=DIR
  • --with-psm(=DIR) and --with-psm-libdir=DIR
  • ...etc.

As mentioned above, by default, Open MPI will try to build support for every feature that it can find on your system. If support for a given feature is not found, Open MPI will simply skip building support for it (this usually means not building a specific plugin).

"Support" for a given feature usually means finding both the relevant header and library files for that feature. As such, the command-line switches listed above are used to override default behavior and allow specifying whether you want support for a given feature or not, and if you do want support, where the header files and/or library files are located (which is useful if they are not located in compiler/linker default search paths). Specifically:

  • If --without-<FOO> is specified, Open MPI will not even look for support for feature FOO. It will be treated as if support for that feature was not found (i.e., it will be skipped).
  • If --with-<FOO> is specified with no optional directory, Open MPI's configure script will abort if it cannot find support for the FOO feature. More specifically, only compiler/linker default search paths will be searched while looking for the relevant header and library files. This option essentially tells Open MPI, "Yes, I want support for FOO -- it is an error if you don't find support for it."
  • If --with-<FOO>=/some/path is specified, it is essentially the same as specifying --with-<FOO> but also tells Open MPI to add -I/some/path/include to compiler search paths, and try (in order) adding -L/some/path/lib and -L/some/path/lib64 to linker search paths when searching for FOO support. If found, the relevant compiler/linker paths are added to Open MPI's general build flags. This option is helpful when support for feature FOO is not found in default search paths.
  • If --with-<FOO>-libdir=/some/path/lib is specified, it only specifies that if Open MPI searches for FOO support, it should use /some/path/lib for the linker search path.

In general, it is usually sufficient to run Open MPI's configure script with no --with-<FOO> options if all the features you need supported are in default compiler/linker search paths. If the features you need are not in default compiler/linker search paths, you'll likely need to specify --with-<FOO> kinds of flags. However, note that it is safest to add --with-<FOO> types of flags if you want to guarantee that Open MPI builds support for feature FOO, regardless of whether support for FOO can be found in default compiler/linker paths or not -- configure will abort if you can't find the appropriate support for FOO. This may be preferable to unexpectedly discovering at run-time that Open MPI is missing support for a critical feature.

Be sure to note the difference in the directory specification between --with-<FOO> and --with-<FOO>-libdir. The former takes a top-level directory (such that "/include", "/lib", and "/lib64" are appended to it) while the latter takes a single directory where the library is assumed to exist (i.e., nothing is suffixed to it).

Finally, note that starting with Open MPI v1.3, configure will sanity check to ensure that any directory given to --with-<FOO> or --with-<FOO>-libdir actually exists and will error if it does not. This prevents typos and mistakes in directory names, and prevents Open MPI from accidentally using a compiler/linker-default path to satisfy FOO's header and library files.


55. Open MPI was pre-installed on my machine; should I overwrite it with a new version?

Probably not.

Many systems come with some version of Open MPI pre-installed (e.g., many Linuxes, BSD variants, and OS X. If you download a newer version of Open MPI from this web site (or one of the Open MPI mirrors), you probably do not want to overwrite the system-installed Open MPI. This is because the system-installed Open MPI is typically under the control of some software package management system (rpm, yum, etc.).

Instead, you probably want to install your new version of Open MPI to another path, such as /opt/openmpi- (or whatever is appropriate for your system).

This FAQ entry also has much more information about strategies for where to install Open MPI.


56. Where should I install Open MPI?

A common environment to run Open MPI is in a "Beowulf"-class or similar cluster (e.g., a bunch of 1U servers in a bunch of racks). Simply stated, Open MPI can run on a group of servers or workstations connected by a network. As mentioned above, there are several prerequisites, however (for example, you typically must have an account on all the machines, you can ssh or ssh between the nodes without using a password etc.).

This raises the question for Open MPI system administrators: where to install the Open MPI binaries, header files, etc.? This discussion mainly addresses this question for homogeneous clusters (i.e., where all nodes and operating systems are the same), although elements of this discussion apply to heterogeneous clusters as well. Heterogeneous admins are encouraged to read this discussion and then see the heterogeneous section of this FAQ.

There are two common approaches:

  1. Have a common filesystem, such as NFS, between all the machines to be used. Install Open MPI such that the installation directory is the same value on each node. This will greatly simplify user's shell startup scripts (e.g., .bashrc, .cshrc, .profile .etc.) -- the PATH can be set without checking which machine the user is on. It also simplifies the system administrator's job; when the time comes to patch or otherwise upgrade OMPI, only one copy needs to be modified.

    For example, consider a cluster of four machines: inky, blinky, pinky, and clyde.

    • Install Open MPI on inky's local hard drive in the directory /opt/openmpi-1.8.1. The system administrator then mounts inky:/opt/openmpi-1.8.1 on the remaining three machines, such that /opt/openmpi-1.8.1 on all machines is effectively "the same". That is, the following directories all contain the Open MPI installation:

      inky:/opt/openmpi-1.8.1
      blinky:/opt/openmpi-1.8.1
      pinky:/opt/openmpi-1.8.1
      clyde:/opt/openmpi-1.8.1
      

    • Install Open MPI on inky's local hard drive in the directory /usr/local/openmpi-1.8.1. The system administrator then mounts inky:/usr/local/openmpi-1.8.1 on all four machines in some other common location, such as /opt/openmpi-1.8.1 (a symbolic link can be installed on inky instead of a mount point for efficiency). This strategy is typically used for environments where one tree is NFS exported, but another tree is typically used for the location of actual installation. For example, the following directories all contain the Open MPI installation:

      inky:/opt/openmpi-1.8.1
      blinky:/opt/openmpi-1.8.1
      pinky:/opt/openmpi-1.8.1
      clyde:/opt/openmpi-1.8.1
      

      Notice that there are the same four directories as the previous example, but on inky, the directory is actually located in /usr/local/openmpi-1.8.1.

    There is a bit of a disadvantage in this approach; each of the remote nodes have to incur NFS (or whatever filesystem is used) delays to access the Open MPI directory tree. However, both the administration ease and low cost (relatively speaking) of using a networked file system usually greatly outweighs the cost. Indeed, once an MPI application is past MPI_INIT, it doesn't use the Open MPI binaries very much.

    NOTE: Open MPI, by default, uses a plugin system for loading functionality at run-time. Most of Open MPI's plugins are opened during the call to MPI_INIT. This can cause a lot of filesystem traffic, which, if Open MPI is installed on a networked filesystem, may be noticable. Two common options to avoid this extra filesystem traffic are to build Open MPI to not use plugins (see this FAQ entry for details) or to install Open MPI locally (see below).

  2. If you are concerned with networked filesystem costs of accessing the Open MPI binaries, you can install Open MPI on the local hard drive of each node in your system. Again, it is highly advisable to install Open MPI in the same directory on each node so that each user's PATH can be set to the same value, regardless of the node that a user has logged on to.

    This approach will save some network latency of accessing the Open MPI binaries, but is typically only used where users are very concerned about squeezing every spare cycle out of their machines, or are running at extreme scale where a networked filesystem may get overwhelmed by filesystem requests for Open MPI binaries when running very large parallel jobs.


57. Should I install a new version of Open MPI over an old version?

We do not recommend this.

Before discussing specifics, here are some definitions that are necessary understand:

  • Source tree: The tree where the Open MPI source code is located. It is typically the result of expanding an Open MPI distribution source code bundle, such as a tarball.
  • Build tree: The tree where Open MPI was built. It is always related to a specific source tree, but may actually be a different tree (since Open MPI supports VPATH builds). Specifically, this is the tree where you invoked configure, make, etc. to build and install Open MPI.
  • Installation tree: The tree where Open MPI was installed. It is typically the "prefix" argument given to Open MPI's configure script; it is the directory from which you run installed Open MPI executables.

In its default configuration, an Open MPI installation consists of several shared libraries, header files, executables, and plugins (dynamic shared objects -- DSOs). These installation files act together as a single entity. The specific filenames and contents of these files are subject to change between different versions of Open MPI.

KEY POINT: Installing one version of Open MPI does not uninstall another version.

If you install a new version of Open MPI over an older version, this may not remove or overwrite all the files from the older version. Hence, you may end up with an incompatible muddle of files from two different installations -- which can cause problems.

The Open MPI team recommends one of the following methods for upgrading your Open MPI installation:

  • Install newer versions of Open MPI into a different directory. For example, install into /opt/openmpi-a.b.c and /opt/openmpi-x.y.z for versions a.b.c and x.y.z, respectively.
  • Completely uninstall the old version of Open MPI before installing the new version. The make uninstall process from Open MPI a.b.c build tree should completely uninstall that version from the installation tree, making it safe to install a new version (e.g., version x.y.z) into the same installation tree.
  • Remove the old installation directory entirely and then install the new version. For example "rm -rf /opt/openmpi" (assuming that there is nothing else of value in this tree!) The installation of Open MPI x.y.z will safely re-create the /opt/openmpi tree. This method is preferable if you no longer have the source and build trees to Open MPI a.b.c available from which to "make uninstall".
  • Go into the Open MPI a.b.c installation directory and manually remove all old Open MPI files. Then install Open MPI x.y.z into the same installation directory. This can be a somewhat painful, annoying, and error-prone process. We do not recommend it. Indeed, if you no longer have access to the original Open MPI a.b.c source and build trees, it may be far simpler to download Open MPI version a.b.c again from the Open MPI web site, configure it with the same installation prefix, and then run "make uninstall". Or use one of the other methods, above.


58. Can I disable Open MPI's use of plugins?

Yes.

Open MPI uses plugins for much of its functionality. Specifically, Open MPI looks for and loads plugins as dynamically shared objects (DSOs) during the call to MPI_INIT. However, these plugins can be compiled and installed in several different ways:

  1. As DSOs: In this mode (the default), each of Open MPI's plugins are compiled as a separate DSO that is dynamically loaded at run time.
    • Advantage: this approach is highly flexible -- it gives system developers and administrators fine-grained approach to install new plugins to an existing Open MPI installation, and also allows the removal of old plugins (i.e., forcibly disallowing the use of specific plugins) simply by removing the corresponding DSO(s).
    • Disadvantage: this approach causes additional filesystem traffic (mostly during MPI_INIT). If Open MPI is installed on a networked filesystem, this can cause noticable network traffic when a large parallel job starts, for example.
  2. As part of a larger library: In this mode, Open MPI "slurps up" the plugins includes them in libmpi (and other libraries). Hence, all plugins are included in the main Open MPI libraries that are loaded by the system linker before an MPI process even starts.
    • Advantage: Significantly less filesystem traffic than the DSO approach. This model can be much more performant on network installations of Open MPI.
    • Disadvantage: Much less flexible than the DSO approach; system administrators and developers have significantly less ability to add/remove plugins from the Open MPI installation at run-time. Note that you still have some ability to add/remove plugins (see below), but there are limitations to what can be done.

To be clear: Open MPI's plugins can be built either as standalone DSOs or included in Open MPI's main libraries (e.g., libmpi). Additionally, Open MPI's main libraries can be built either as static or shared libraries.

You can therefore choose to build Open MPI in one of several different ways:

  1. --disable-mca-dso: Using the --disable-mca-dso switch to Open MPI's configure script will cause all plugins to be built as part of Open MPI's main libraries -- they will not be built as standalone DSOs. However, Open MPI will still look for DSOs in the filesystem at run-time. Specifically: this option significantly decreases (but does not eliminate) filesystem traffic during MPI_INIT, but does allow the flexibility of adding new plugins to an existing Open MPI installation.

    Note that the --disable-mca-dso option does not affect whether Open MPI's main libraries are built as static or shared.

  2. --enable-static: Using this option to Open MPI's configure script will cause the building of static libraries (e.g., libmpi.a). This option automatically implies --disable-mca-dso.

    Note that --enable-shared is also the default; so if you use --enable-static, Open MPI will build both static and shared libraries that contain all of Open MPI's plugins (i.e., libmpi.so and libmpi.a). If you want only static libraries (that contain all of Open MPI's plugins), be sure to also use --disable-shared.

  3. --disable-dlopen: Using this option to Open MPI's configure script will do two things:
    1. Imply --disable-mca-dso, meaning that all plugins will be slurped into Open MPI's libraries.
    2. Cause Open MPI to not look for / open any DSOs at run time.

    Specifically: this option makes Open MPI not incur any additional filesystem traffic during MPI_INIT. Note that the --disable-dlopen option does not affect whether Open MPI's main libraries are built as static or shared.


59. How do I build an optimized version of Open MPI?

If you have obtained a developer's checkout from Subversion (or Mercurial), you must consult these directions.

Building Open MPI from a tarball defaults to building an optimized version. There is no need to do anything special.


60. Are VPATH and/or parallel builds supported?

Yes, both VPATH and parallel builds are supported. This allows Open MPI to be built in a different directory than where its source code resides (helpful for multi-architecture builds). Open MPI uses Automake for its build system, so

For example:

shell$ gtar zxf openmpi-1.2.3.tar.gz
shell$ cd openmpi-1.2.3
shell$ mkdir build
shell$ cd build
shell$ ../configure ...
<... lots of output ...>
shell$ make -j 4

Running configure from a different directory from where it actually resides triggers the VPATH build (i.e., it will configure and built itself from the directory where configure was run, not from the directory where configure resides).

Some versions of make support parallel builds. The example above shows GNU make's "-j" option, which specifies how many compile processes may be executing any any given time. We, the Open MPI Team, have found that doubling or quadrupling the number of processors in a machine can significantly speed up an Open MPI compile (since compiles tend to be much more IO bound than CPU bound).


61. Do I need any special tools to build Open MPI?

If you are building Open MPI from a tarball, you need a C compiler, a C++ compiler, and make. If you are building the Fortran 77 and/or Fortran 90 MPI bindings, you will need compilers for these languages as well. You do not need any special version of the GNU "Auto" tools (Autoconf, Automake, Libtool).

If you are building Open MPI from a Subversion checkout, you need some additional tools. See the Subversion access pages for more information.


62. How do I build Open MPI as a static library?

As noted above, Open MPI defaults to building shared libraries and building components as dynamic shared objects (DSOs, i.e., run-time plugins). Changing this build behavior is controlled via command line options to Open MPI's configure script.

Building static libraries: You can disable building shared libraries and enable building static libraries with the following options:

shell$ ./configure --enable-static --disable-shared ...

Similarly, you can build both static and shared libraries by simply specifying --enable-static (and not specifying --disable-shared), if desired.

Including components in libraries: Instead of building components as DSOs, they can also be "rolled up" and included in their respective libraries (e.g., libmpi). This is controlled with the --enable-mca-static option. Some examples:

shell$ ./configure --enable-mca-static=pml ...
shell$ ./configure --enable-mca-static=pml,btl-openib,btl-self ...

Specifically, entire frameworks and/or individual components can be specified to be rolled up into the library in a comma-separated list as an argument to --enable-mca-static.


63. When I run 'make', it looks very much like the build system is going into a loop.

Open MPI uses the GNU Automake software to build itself. Automake uses a tightly-woven set of file timestamp-based dependencies to compile and link software. This behavior, frequently paired with messages similar to:

Warning: File `Makefile.am' has modification time 3.6e+04 s in the future

typically means that you are building on a networked filesystem where the local time of the client machine that you are building on does not match the time on the network filesystem server. This will result in files with incorrect timestamps, and Automake degenerates into undefined behavior.

Two solutions are possible:

  1. Ensure that the time between your network filesystem server and client(s) is the same. This can be accomplished in a variety of ways and is dependent upon your local setup; one method is to use an NTP daemon to synchronize all machines to a common time server.
  2. Build on a local disk filesystem where network timestamps are not a factor.

After implementing one of the two options, you will likely need to re-run configure. Then Open MPI should build successfully.


64. Configure issues warnings about sed and unterminated commands

Some users have reported seeing warnings like this in the final output from configure:

*** Final output 
configure: creating ./config.status 
config.status: creating ompi/include/ompi/version.h 
sed: file ./confstatA1BhUF/subs-3.sed line 33: unterminated `s' command 
sed: file ./confstatA1BhUF/subs-4.sed line 4: unterminated `s' command 
config.status: creating orte/include/orte/version.h 

These messages usually indicate a problem in the user's local shell configuration. Ensure that when you run a new shell, no output is sent to stdout. For example, if the output of this simple shell script is more than just the hostname of your computer, you need to go check your shell startup files to see where the extraneous output is coming from (and eliminate it):


#!/bin/sh
`hostname`
exit 0


65. Open MPI configured ok, but I get "Makefile:602: *** missing separator" kinds of errrs when building

This is usually an indication that configure succeeded but really shouldn't have. See this FAQ entry for one possible cause.


66. Open MPI seems to default to building with the GNU compiler set. Can I use other compilers?

Yes.

Open MPI uses a standard Autoconf "configure" script to probe the current system and figure out how to build itself. One of the choices it makes it which compiler set to use. Since Autoconf is a GNU product, it defaults to the GNU compiler set. However, this is easily overridden on the configure command line. For example, to build Open MPI with the Intel compiler suite:

shell$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort ...

Note that you can include additional parameters to configure, implied by the "..." clause in the example above.

In particular, 4 switches on the configure command line are used to specify the compiler suite:

  • CC: Specifies the C compiler
  • CXX: Specifies the C++ compiler
  • F77: Specifies the Fortran 77 compiler
  • FC: Specifies the Fortran 90 compiler

NOTE: The Open MPI team recommends using a single compiler suite whenever possible. Unexpeced or undefined behavior can occur when you mix compiler suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 compilers between different compiler suites is almost guaranteed not to work).

Here are some more examples for common compilers:

# Portland compilers
shell$ ./configure CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90
# Pathscale compilers
shell$ ./configure CC=pathcc CXX=pathCC F77=pathf90 FC=pathf90
# Oracle Solaris Studio (Sun) compilers
shell$ ./configure CC=cc CXX=CC F77=f77 FC=f90

In all cases, the compilers must be found in your PATH and be able to successfully compile and link non-MPI applications before Open MPI will be able to be built properly.


67. Can I pass specific flags to the compilers / linker used to build Open MPI?

Yes.

Open MPI uses a standard Autoconf configure script to set itself up for building. As such, there are a number of command line options that can be passed to configure to customize flags that are passed to the underlying compiler to build Open MPI:

  • CFLAGS: Flags passed to the C compiler.
  • CXXFLAGS: Flags passed to the C++ compiler.
  • FFLAGS: Flags passed to the Fortran 77 compiler.
  • FCFLAGS: Flags passed to the Fortran 90 compiler.
  • LDFLAGS: Flags passed to the linker (not language-specific). This flag is rarely required; Open MPI will usually pick up all LDFLAGS that it needs by itself.
  • LIBS: Extra libraries to link to Open MPI (not language-specific). This flag is rarely required; Open MPI will usually pick up all LIBS that it needs by itself.
  • LD_LIBRARY_PATH: Note that we do not recommend setting LD_LIBRARY_PATH via configure, but it is worth noting that you should ensure that your LD_LIBRARY_PATH value is appropriate for your build. Some users have been tripped up, for example, by specifying a non-default Fortran compiler to FC and F77, but then having Open MPI's configure script fail because the LD_LIBRARY_PATH wasn't set properly to point to that Fortran compiler's support libraries.

Note that the flags you specify must be compatible across all the compilers. In particular, flags specified to one language compiler must generate code that can be compiled and linked against code that is generated by the other language compilers. For example, on a 64 bit system where the compiler default is to build 32 bit executables:

# Assuming the GNU compiler suite
shell$ ./configure CFLAGS=-m64 ...

will produce 64 bit C objects, but 32 bit objects for C++, Fortran 77, and Fortran 90. These codes will be incompatible with each other, and Open MPI will build successfully. Instead, you must specify to build 64 bit objects for all languages:

# Assuming the GNU compiler suite
shell$ ./configure CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 ...

The above command line will pass "-m64" to all four compilers, and therefore will produce 64 bit objects for all languages.


68. I'm trying to build with the Intel compilers, but Open MPI eventually fails to compile with really long error messages. What do I do?

A common mistake when building Open MPI with the Intel compiler suite is to accidentally specify the Intel C compiler as the C++ compiler. Specifically, recent versions of the Intel compiler renamed the C++ compiler "icpc" (it used to be "icc", the same as the C compiler). Users accustomed to the old name tend to specify "icc" as the C++ compiler, which will then cause a failure late in the Open MPI build process because a C++ code will be compiled with the C compiler. Bad Things then happen.

The solution is to be sure to specify that the C++ compiler is "icpc", not "icc". For example:

shell$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort ...

For Googling purposes, here's some of the error messages that may be issued during the Open MPI C++ codes with the Intel C compiler (icc), in no particular order:

IPO Error: unresolved : _ZNSsD1Ev
IPO Error: unresolved : _ZdlPv
IPO Error: unresolved : _ZNKSs4sizeEv
components.o(.text+0x17): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string, std::allocator >::basic_string()'
components.o(.text+0x64): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string, std::allocator >::basic_string()'
components.o(.text+0x70): In function `ompi_info::open_components()':
: undefined reference to `std::string::size() const'
components.o(.text+0x7d): In function `ompi_info::open_components()':
: undefined reference to `std::string::reserve(unsigned int)'
components.o(.text+0x8d): In function `ompi_info::open_components()':
: undefined reference to `std::string::append(char const*, unsigned int)'
components.o(.text+0x9a): In function `ompi_info::open_components()':
: undefined reference to `std::string::append(std::string const&)'
components.o(.text+0xaa): In function `ompi_info::open_components()':
: undefined reference to `std::string::operator=(std::string const&)'
components.o(.text+0xb3): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string, std::allocator >::~basic_string()'

There are many more error messages, but the above should be sufficient for someone trying to find this FAQ entry via a web crawler.


69. When I build with the Intel compiler suite, linking user MPI applications with the wrapper compilers results in warning messages. What do I do?

When Open MPI was built with some versions of the Intel compilers on some platforms, you may see warnings similar to the following when compiling MPI applications with Open MPI's wrapper compilers:

shell$ mpicc hello.c -o hello
libimf.so: warning: warning: feupdateenv is not implemented and will always fail
shell$ 

This warning is generally harmless, but it can be alarming to some users. To remove this warning, pass either the -shared-intel or -i-dynamic options when linking your MPI application (the specific option depends on your version of the Intel compilers; consult your local documentation):

shell$ mpicc hello.c -o hello -shared-intel
shell$ 

You can also change the default behavior of Open MPI's wrapper compilers to automatically include this -shared-intel flag so that it is unnecessary to specify it on the command line when linking MPI applications.


70. I'm trying to build with the IBM compilers, but Open MPI eventually fails to compile. What do I do?

Unfortunately there are some problems between Libtool (which Open MPI uses for library support) and the IBM compilers when creating shared libraries. Currently the only workaround is to disable shared libraries and build Open MPI statically. For example:

shell$ ./configure CC=xlc CXX=xlc++ F77=xlf FC=xlf90 --disable-shared --enable-static ...

For Googling purposes, here's a error message that may be issued when the build fails:

xlc: 1501-216 command option --whole-archive is not recognized - passed to ld
xlc: 1501-216 command option --no-whole-archive is not recognized - passed to ld
xlc: 1501-218 file libopen-pal.so.0 contains an incorrect file suffix
xlc: 1501-228 input file libopen-pal.so.0 not found


71. I'm trying to build with the Oracle Solaris Studio (Sun) compilers on Linux, but Open MPI eventually fails to compile. What do I do?

Below are some known issues that impact Oracle Solaris Studio 12 Open MPI builds. The easiest way to work around them is simply to use the latest version of the Oracle Solaris Studio 12 compilers.


72. What configure options should I use when building with the Oracle Solaris Studio (Sun) compilers?

The below configure options are suggested for use with the Oracle Solaris Studio (Sun) compilers:

--enable-heterogeneous
--enable-cxx-exceptions
--enable-shared
--enable-orterun-prefix-by-default
--enable-mpi-f90
--with-mpi-f90-size=small
--disable-mpi-threads
--disable-progress-threads
--disable-debug

Linux only:

--with-openib
--without-udapl
--disable-openib-ibcm (only in v1.5.4 and earlier)

Solaris x86 only:

CFLAGS="-xtarget=generic -xarch=sse2 -xprefetch -xprefetch_level=2 -xvector=simd -xdepend=yes -xbuiltin=%all -xO5"
FFLAGS="-xtarget=generic -xarch=sse2 -xprefetch -xprefetch_level=2 -xvector=simd -stackvar -xO5"

Solaris SPARC only:

CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -xdepend=yes -xbuiltin=%all -xO5"
FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -stackvar -xO5"


73. When building with the Oracle Solaris Studio 12 Update 1 (Sun) compilers on x86 Linux, the compiler loops on btl_sm.c. Is there a workaround?

Apply Sun patch 141859-04.

You may also consider updating your Oracle Solaris Studio compilers to the latest Oracle Solaris Studio Express.


74. How do I build OpenMPI on IBM QS22 cell blade machines with GCC and XLC/XLF compilers?

You can use two following scripts (contributed by IBM) to build Open MPI on QS22.

Script to build OpenMPI using the GCC compiler

#!/bin/bash
export PREFIX=/usr/local/openmpi-1.2.7_gcc

./configure \
        CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m64 \
        CXXFLAGS=-m64 FC=ppu-gfortran  FCFLAGS=-m64 \
        FFLAGS=-m64 CCASFLAGS=-m64 LDFLAGS=-m64 \
        --prefix=$PREFIX \
        --with-platform=optimized \
        --disable-mpi-profile \
        --with-openib=/usr \
        --enable-ltdl-convenience \
        --with-wrapper-cflags=-m64 \
        --with-wrapper-ldflags=-m64 \
        --with-wrapper-fflags=-m64 \
        --with-wrapper-fcflags=-m64

make
make install

cat <> $PREFIX/etc/openmpi-mca-params.conf
mpi_paffinity_alone = 1
mpi_leave_pinned = 1
btl_openib_want_fork_support = 0
EOF

cp config.status $PREFIX/config.status


Script to build OpenMPI using XLC and XLF compilers

#!/bin/bash
#
export PREFIX=/usr/local/openmpi-1.2.7_xl

./configure --prefix=$PREFIX \
            --with-platform=optimized \
            --disable-shared --enable-static \
            CC=ppuxlc CXX=ppuxlc++ F77=ppuxlf FC=ppuxlf90 LD=ppuld \
            --disable-mpi-profile \
            --disable-heterogeneous \
            --with-openib=/usr \
            CFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            CXXFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            FFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            FCFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            CCASFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            LDFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --enable-ltdl-convenience \
            --with-wrapper-cflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --with-wrapper-ldflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --with-wrapper-fflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --with-wrapper-fcflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
            --enable-contrib-no-build=libnbc,vt

make
make install

cat <> $PREFIX/etc/openmpi-mca-params.conf
mpi_paffinity_alone = 1
mpi_leave_pinned = 1
btl_openib_want_fork_support = 0
EOF

cp config.status $PREFIX/config.status


75. I'm trying to build with the PathScale 3.0 and 3.1 compilers on Linux, but all Open MPI commands seg fault. What do I do?

The PathScale compiler authors have identified a bug in the v3.0 and v3.1 versions of their compiler; you must disable certain "builtin" functions when building Open MPI:

  1. With PathScale 3.0 and 3.1 compilers use the workaround options -O2 and -fno-builtin in CFLAGS across the Open MPI build. For example:

    shell$ ./configure CFLAGS="-O2 -fno-builtin" ...
    

  2. With PathScale 3.2 beta and later, no workaround options are required.


76. All MPI C++ API functions return errors (or otherwise fail) when Open MPI is compiled with the PathScale compilers. What do I do?

This is an old issue that seems to be a problem when Pathscale uses a back-end GCC 3.x compiler. Here's a proposed solution from the Pathscale support team (from July 2010):

The proposed work-around is to install gcc-4.x on the system and use the pathCC -gnu4 option. Newer versions of the compiler (4.x and beyond) should have this fixed, but we'll have to test to confirm it's actually fixed and working correctly.

We don't anticipate that this will be much of a problem for Open MPI users these days (our informal testing shows that not many users are still using GCC 3.x), but this information is provided so that it is Google-able for those still using older compilers.


77. How do I build Open MPI with support for Open IB (Infiniband), mVAPI (Infiniband), GM (Myrinet), and/or MX (Myrinet)?

To build support for high-speed interconnect networks, you generally only have to specify the directory where its support header files and libraries were installed to Open MPI's configure script. You can specify where multiple packages were installed if you have support for more than one kind of interconnect -- Open MPI will build support for as many as it can.

You tell configure where support libraries are with the appropriate --with command line switch. Here is the list of available switches:

  • --with-openib=<dir>: Build support for OpenFabrics (previously known as "Open IB", for Infiniband and iWARP networks -- note that iWARP support was added in the v1.3 series).
  • --with-mvapi=<dir>: Build support for mVAPI (Infiniband -- note that mVAPI support has been removed in the v1.3 series).
  • --with-gm=<dir>: Build support for GM (Myrinet).
  • --with-mx=<dir>: Build support for MX (Myrinet).

For example:

shell$ ./configure --with-mvapi=/path/to/mvapi/installation \
  --with-gm=/path/to/gm/installation

These switches enable Open MPI's configure script to automatically find all the right header files and libraries to support the various networks that you specified.

You can verify that configure found everything properly by examining its output -- it will test for each network's header files and libraries and report whether it will build support (or not) for each of them. Examining configure's output is the first place you should look if you have a problem with Open MPI not correctly supporting a specific network type.

If configure indicates that support for your networks will be included, after you build and install Open MPI, you can run the "ompi_info" command and look for components for your networks. The v1.2 (and earlier) series has two openib components (your exact version numbers may be different):

shell$ ompi_info | grep openib
               MCA mpool: openib (MCA v1.0, API v1.0, Component v1.0)
                 MCA btl: openib (MCA v1.0, API v1.0, Component v1.0)

mVAPI components will be named "mvapi", GM components will be named "gm", and MX components will be named "mx".

Note that the v1.3 series removed the "openib" mpool component and also removed all support for mVAPI.


78. How do I build Open MPI with support for SLURM / XGrid?

SLURM support is built automatically; there is nothing that you need to do.

XGrid support is built automatically if the XGrid tools are installed.


79. How do I build Open MPI with support for SGE?

Support for SGE first appeared in the Open MPI v1.2 series. The method for configuring it is slightly different between Open MPI v1.2 and v1.3.

For Open MPI v1.2, no extra configure arguments are needed as SGE support is built in automatically. After Open MPI is installed, you should see two components named gridengine.

shell$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.5)
                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.5)

For Open MPI v1.3, you need to explicitly request the SGE support with the "--with-sge" command line switch to the Open MPI configure script. For example:

shell$ ./configure --with-sge

After Open MPI is installed, you should see one component named gridengine.

shell$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

Open MPI v1.3 only has the one specific gridengine component as the other functionality was rolled into other components.

Component versions may vary depending on the version of Open MPI 1.2 or 1.3 you are using.


80. How do I build Open MPI with support for PBS Pro / Open PBS / Torque?

Support for PBS Pro, Open PBS, and Torque must be explicitly requested with the "--with-tm" command line switch to Open MPI's configure script. In general, the procedure is the same building support for high-speed interconnect networks, except that you use --with-tm. For example:

shell$ ./configure --with-tm=/path/to/pbs_or_torque/installation

After Open MPI is installed, you should see two components named "tm":

shell$ ompi_info | grep tm
                 MCA pls: tm (MCA v1.0, API v1.0, Component v1.0)
                 MCA ras: tm (MCA v1.0, API v1.0, Component v1.0)

Specific frameworks and version numbers may vary, depending on your version of Open MPI.

NOTE: Update to the note below (May 2006), Torque 2.1.0p0 now includes support for shared libraries and the workarounds listed below are no longer necessary. However, this version of Torque changed other things that require upgrading Open MPI to 1.0.3 or higher (as of this writing, v1.0.3 has not yet been released -- nightly snapshot tarballs of what will become 1.0.3 are available at http://www.open-mpi.org/nightly/v1.0/).

NOTE: As of this writing (October 2006), Open PBS, and PBS Pro do not (i.e., they only include static libraries). Because of this, you may run into linking errors when Open MPI tries to create dynamic plugin components for TM support on some platforms. Notably, on at least some 64 bit Linux platforms (e.g., AMD64), trying to create a dynamic plugin that links against a static library will result in error messages such as:

relocation R_X86_64_32S against `a local symbol' can not be used when
making a shared object; recompile with -fPIC

Note that recent versions of Torque (as of October 2006) have started shipping shared libraries and this issue does not occur.

There are two possible solutions in Open MPI 1.0.x:

  1. Recompile your PBS implementation with "-fPIC" (or whatever the relevant flag is for your compiler to generate position-independent code) and re-install. This will allow Open MPI to generate dynamic plugins with the PBS/Torque libraries properly.

    PRO: Open MPI enjoys the benefits of shared libraries and dynamic plugins.

    CON: Dynamic plugins can use more memory at run-time (e.g., operating systems tend to align each plugin on a page, rather than densely packing them all into a single library).

    CON: This is not possible for binary-only vendor distributions (such as PBS Pro).

  2. Configure Open MPI to build a static library that includes all of its components. Specifically, all of Open MPI's components will be included in its libraries -- none will be discovered and opened at run-time. This does not affect user MPI code at all (i.e., the location of Open MPI's plugins is transparent to MPI applications). Use the following options to Open MPI's configure script:

    shell$ ./configure --disable-shared --enable-static ...
    

    Note that this option only changes the location of Open MPI's default set of plugins (i.e., they are included in libmpi and friends rather than being standalone dynamic shared objects that are found/opened at run-time). This option does not change the fact that Open MPI will still try to open other dynamic plugins at run-time.

    PRO: This works with binary-only vendor distributions (e.g., PBS Pro).

    CON: User applications are statically linked to Open MPI; if Open MPI -- or any of its default set of components -- is updated, users will need to re-link their MPI applications.

Both methods work equally well, but there are tradeoffs; each site will likely need to make its own determination of which to use.


81. How do I build Open MPI with support for LoadLeveler?

Support for LoadLeveler will be automatically built if the LoadLeveler libraries and headers are in the default path. If not, support must be explicitly requested with the "--with-loadleveler" command line switch to Open MPI's configure script. In general, the procedure is the same building support for high-speed interconnect networks, except that you use --with-loadleveler. For example:

shell$ ./configure --with-loadleveler=/path/to/LoadLeveler/installation

After Open MPI is installed, you should see one or more components named "loadleveler":

shell$ ompi_info | grep loadleveler
                 MCA ras: loadleveler (MCA v1.0, API v1.3, Component v1.3)

Specific frameworks and version numbers may vary, depending on your version of Open MPI.


82. How do I build Open MPI with support for Platform LSF?

Note that only Platform LSF 7.0.2 and later is supported.

Support for LSF will be automatically built if the LSF libraries and headers are in the default path. If not, support must be explicitly requested with the "--with-lsf" command line switch to Open MPI's configure script. In general, the procedure is the same building support for high-speed interconnect networks, except that you use --with-lsf. For example:

shell$ ./configure --with-lsf=/path/to/lsf/installation

After Open MPI is installed, you should see a component named "lsf":

shell$ ompi_info | grep lsf
                 MCA ess: lsf (MCA v2.0, API v1.3, Component v1.3)
                 MCA ras: lsf (MCA v2.0, API v1.3, Component v1.3)
                 MCA plm: lsf (MCA v2.0, API v1.3, Component v1.3)

Specific frameworks and version numbers may vary, depending on your version of Open MPI.


83. How do I build Open MPI with processor affinity support?

Open MPI currently only supports processor affinity for some platforms. In general, processor affinity will automatically be built if it is supported -- no additional command line flags to configure should be necessary.

See this FAQ entry for more details.


84. How do I build Open MPI with memory affinity / NUMA support (e.g., libnuma)?

Open MPI currently only supports libnuma memory affinity for Linux-based systems (please let us know if there are other NUMA libraries that you need supported!).

Support for libnuma must be explicitly requested with the "--with-libnuma" command line switch to Open MPI's configure script. In general, the procedure is the same building support for high-speed interconnect networks, except that you use --with-libnuma. For example:

shell$ ./configure --with-libnuma=/path/to/libnuma/installation

After Open MPI is installed, you should see an maffinity component named "libnuma":

shell$ ompi_info | grep libnuma
           MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0)

Specific frameworks and version numbers may vary, depending on your version of Open MPI.

See this FAQ entry for more details.


85. How do I build Open MPI with CUDA-aware support?

CUDA-aware support means that the MPI library can send and receive GPU buffers directly. This feature exists in the Open MPI 1.7 series and later. The support is being continuously updated so different levels of support exist in different versions.

Configuring Open MPI 1.7, MPI 1.7.1 and 1.7.2

  --with-cuda(=DIR)       Build cuda support, optionally adding DIR/include,
                          DIR/lib, and DIR/lib64
  --with-cuda-libdir=DIR  Search for cuda libraries in DIR

Here are some examples of configure commands that enable CUDA support.

1. Searches in default locations. Looks for cuda.h in /usr/local/cuda/include and libcuda.so in /usr/lib64.

 ./configure --with-cuda

2. Searches for cuda.h in /usr/local/cuda-v4.0/cuda/include and libcuda.so in default location of /usr/lib64.

 ./configure --with-cuda=/usr/local/cuda-v4.0/cuda

3. Searches for cuda.h in /usr/local/cuda-v4.0/cuda/include and libcuda.so in /usr/lib64. (same as previous one)

 ./configure --with-cuda=/usr/local/cuda-v4.0/cuda --with-cuda-libdir=/usr/lib64

If the cuda.h or libcuda.so files cannot be found, then the configure will abort.

Note: There is a bug in Open MPI 1.7.2 such that you will get an error if you configure the library with --enable-static. To get around this error, add the following to your configure line and reconfigure. This disables the build of the PML BFO which is largely unused anyways. This bug is fixed in Open MPI 1.7.3.

 --enable-mca-no-build=pml-bfo

Configuring Open MPI 1.7.3 and later

With Open MPI 1.7.3 and later the libcuda.so library is loaded dynamically so there is no need to specify a path to it at configure time. Therefore, all you need is the path to the cuda.h header file.

1. Searches in default locations. Looks for cuda.h in /usr/local/cuda/include.

 ./configure --with-cuda

2. Searches for cuda.h in /usr/local/cuda-v5.0/cuda/include.

 ./configure --with-cuda=/usr/local/cuda-v5.0/cuda

Note that you cannot configure with --disable-dlopen as that will break the ability of the Open MPI library to dynamically load libcuda.so.

See this FAQ entry for detals on how to use the CUDA support.


86. How do I not build a specific plugin / component for Open MPI?

The --enable-mca-no-build option to Open MPI's configure script enables you to specify a list of components that you want to skip building. This allow you to not include support for specific features in Open MPI if you do not want to.

It takes a single argmuent: a comma-delimited list of framework/component pairs inidicating which specific components you do not want to build. For example:

shell$ ./configure --enable-mca-no-build=paffinity-linux,timer-solaris

Note that this option is really only useful for components that would otherwise be built. For example, if you are on a machine without Myrinet support, it is not necessary to specify:

shell$ ./configure --enable-mca-no-build=btl-gm

because the configure script will naturally see that you do not have support for GM and will automatically skip the gm BTL component.


87. What other options to [configure] exist?

There are many options to Open MPI's configure script. Please run the following to get a full list (including a short description of each option);

shell$ ./configure --help


88. Why does compiling the Fortran 90 bindings take soooo long?

NOTE: Starting with Open MPI v1.7, if you are not using gfortran, buidling the Fortran 90 and 08 bindings do not suffer the same performance penalty that previous versions incurred. The Open MPI developers encourage all users to upgrade to the new Fortran bindings implementation -- including the new MPI-3 Fortran'08 bindings -- when possible.

This is actually a design problem with the MPI F90 bindings themselves. The issue is that since F90 is a strongly typed language, we have to overload each function that takes a choice buffer with a typed buffer. For example, MPI_SEND has many different overloaded versions -- one for each type of the user buffer. Specifically, there is an MPI_SEND that has the following types for the first argument:

  • logical*1, logical*2, logical*4, logical*8, logical*16 (if supported)
  • integer*1, integer*2, integer*4, integer*8, integer*16 (if supported)
  • real*4, real*8, real*16 (if supported)
  • complex*8, complex*16, complex*32 (if supported)
  • character

On the surface, this is 17 bindings for MPI_SEND. Multiply this by every MPI function that takes a choice buffer (50) and you 850 overloaded functions. However, the problem gets worse -- for each type, we also have to overload for each array dimension that needs to be supported. Fortran allows up to 7 dimensional arrays, so this becomes (17x7) = 119 versions of every MPI function that has a choice buffer argument. This makes (17x7x50) = 5,950 MPI interface functions.

To make matters even worse, consider the ~25 MPI functions that take 2 choice buffers. Functions have to be provided for all possible combinations of types. This then becomes exponential -- the total number of interface functions balloons up to 6.8M.

Additionally, F90 modules must all have their functions in a single source file. Hence, all 6.8M functions must be in one .f90 file and compiled as a single unit (currently, no F90 compiler that we are aware of can handle 6.8M interface functions in a single module).

To limit this problem, Open MPI, by default, does not generate interface functions for any of the 2-buffer MPI functions. Additionally, we limit the maximum number of supported dimensions to 4 (instead of 7). This means that we're generating (17x4*50) = 3,400 interface functions in a single F90 module. So it's far smaller than 6.8M functions, but it's still quite a lot.

This is what makes compiling the F90 module take so long.

Note, however, you can limit the maximum number of dimensions that Open MPI will generate for the F90 bindings with the configure switch --with-f90-max-array-dim=DIM, where DIM is an integer <= 7. The default value is 4. Decreasing this value makes the compilation go faster, but obviously supports fewer dimensions.

Other than this limit on dimension size, there is little else that we can do -- the MPI-2 F90 bindings were unfortunately not well thought out in this regard.

Note, however, that the Open MPI team has proposed Fortran '03 bindings for MPI in a paper that was presented at the Euro PVM/MPI'05 conference. These bindings avoid all the scalability problems that are described above and have some other nice properties.

This is something that is being worked on in Open MPI, but there is currently have no estimated timeframe on when it will be available.


89. Does Open MPI support MPI_REAL16 and MPI_COMPLEX32?

It depends. Note that these datatypes are optional in the MPI standard.

Prior to v1.3, Open MPI supported MPI_REAL16 and MPI_COMPLEX32 if a portable C integer type could be found that was the same size (measured in bytes) as Fortran's REAL*16 type. It was later discovered that even though the sizes may be the same, the bit representations between C and Fortran may be different. Since Open MPI's reduction routines are implemented in C, calling MPI_REDUCE (and related functions) with MPI_REAL16 or MPI_COMPLEX32 would generate undefined results (although message passing with these types in homogeneous environments generally worked fine).

As such, Open MPI v1.3 made the test for supporting MPI_REAL16 and MPI_COMPLEX32 more stringent: Open MPI will support these types only if:

  • An integer C type can be found that has the same size (measured in bytes) as the Fortran REAL*16 type.
  • The bit representation is the same between the C type and the Fortran type.

Version 1.3.0 only checks for portable C types (e.g., long double). A future version of Open MPI may include support for compiler-specific / non-portable C types. For example, the Intel compiler has specific options for creating a C type that is the same as REAL*16, but we did not have time to include this support in Open MPI v1.3.0.


90. Can I re-locate my Open MPI installation without re-configuring/re-compiling/re-installing from source?

Starting with Open MPI v1.2.1, yes.

Background: Open MPI hard-codes some directory paths in its executables based on installation paths specified by the configure script. For example, if you configure with an installation prefix of /opt/openmpi/, Open MPI encodes in its executables that it should be able to find its help files in /opt/openmpi/share/openmpi.

The "installdirs" functionality in Open MPI lets you change any of these hard-coded directory paths at run time (assuming that you have already adjusted your PATH and/or LD_LIBRARY_PATH environment variables to the new location where Open MPI now resides). There are three methods:

  1. Move an existing Open MPI installation to a new prefix: Set the OPAL_PREFIX environment variable before launching Open MPI. For example, if Open MPI had initially been installed to /opt/openmpi and the entire openmpi tree was later moved to /home/openmpi, setting OPAL_PREFIX to /home/openmpi will enable Open MPI to function properly.
  2. "Stage" an Open MPI installation in a temporary location: When creating self-contained installation packages, systems such as RPM install Open MPI into temporary locations. The package system then bundles up everything under the temporary location into a package that can be installed into its real location later. For example, when creating an RPM that will be installed to /opt/openmpi, the RPM system will transparently prepend a "destination directory" (or "destdir") to the installation directory. As such, Open MPI will think that it is installed in /opt/openmpi, but it is actually temporarily installed in (for example) /var/rpm/build.1234/opt/openmpi. If it is necessary to use Open MPI while it is installed in this staging area, the OPAL_DESTDIR environment variable can be used; setting OPAL_DESTDIR to /var/rpm/build.1234 will automatically prefix every directory such that Open MPI can function properly.
  3. Overriding invidividual directories: Open MPI uses the GNU-specified directories (per Autoconf/Automake), and can be overridden by setting environment variables directly related to their common names. The list of environment variables that can be used is:

    • OPAL_PREFIX
    • OPAL_EXEC_PREFIX
    • OPAL_BINDIR
    • OPAL_SBINDIR
    • OPAL_LIBEXECDIR
    • OPAL_DATAROOTDIR
    • OPAL_DATADIR
    • OPAL_SYSCONFDIR
    • OPAL_SHAREDSTATEDIR
    • OPAL_LOCALSTATEDIR
    • OPAL_LIBDIR
    • OPAL_INCLUDEDIR
    • OPAL_INFODIR
    • OPAL_MANDIR
    • OPAL_PKGDATADIR
    • OPAL_PKGLIBDIR
    • OPAL_PKGINCLUDEDIR

    Note that not all of the directories listed above are used by Open MPI; they are listed here in entirety for completeness.

    Also note that several directories listed above are defined in terms of other directories. For example, the $bindir is defined by default as $prefix/bin. Hence, overriding the $prefix (via OPAL_PREFIX) will automatically change the first part of the $bindir (which is how method 1 described above works). Alternatively, OPAL_BINDIR can be set to an absolute value that ignores $prefix altogether.


91. I'm still having problems / my problem is not listed here. What do I do?

Please see this FAQ category for troubleshooting tips and the Getting Help page -- it details how to send a request to the Open MPI mailing lists.


92. In general, how do I build MPI applications with Open MPI?

The Open MPI team strongly recommends that you simply use Open MPI's "wrapper" compilers to compile your MPI applications. That is, instead of using (for example) gcc to compile your program, use mpicc. Open MPI provides a wrapper compiler for four languages:

Language Wrapper compiler name
C mpicc
C++ mpiCC, mpicxx, or mpic++
(note that mpiCC will not exist
on case-insensitive filesystems)
Fortran mpifort (for v1.7 and above)
mpif77 and mpif90 (for older versions)

Hence, if you expect to compile your program as:

shell$ gcc my_mpi_application.c -o my_mpi_application

Simply use the following instead:

shell$ mpicc my_mpi_application.c -o my_mpi_application

Note that Open MPI's wrapper compilers do not do any actual compiling or linking; all they do is manipulate the command line and add in all the relevant compiler / linker flags and then invoke the underlying compiler / linker (hence, the name "wrapper" compiler). More specifically, if you run into a compiler or linker error, check your source code and/or back-end compiler -- it is usually not the fault of the Open MPI wrapper compiler.


93. Wait -- what is mpifort? Shouldn't I use mpif77 and mpif90?

mpifort is a new name for the Fortran wrapper compiler that debuted in Open MPI v1.7.

It supports compiling all versions of Fortran, and utilizing all MPI Fortran interfaces (mpif.h, use mpi, and use mpi_f08). There is no need to distinguish between "Fortran 77" (which hasn't existed for 30+ years) or "Fortran 90" -- just use mpifort to compile all your Fortran MPI applications and don't worry about what dialect it is, nor which MPI Fortran interface it uses.

Other MPI implementations will also soon support a wrapper compiler named mpifort, so hopefully we can move the whole world to this simpler wrapper compiler name, and elminiate the use of mpif77 and mpif90.

Specifically: mpif77 and mpif90 are deprecated as of Open MPI v1.7. Although mpif77 and mpif90 still exist in Open MPI v1.7 for legacy reasons, they will likely be removed in some (undetermined) future release. It is in your interest to convert to mpifort now.

Also note that these names are literally just sym links to mpifort under the covers. So you're using mpifort whether you realize it or not. :-)

Basically, the 1980's called; they want their mpif77 wrapper compiler back. Let's let them have it.


94. I can't / don't want to use Open MPI's wrapper compilers. What do I do?

We repeat the above statement: the Open MPI Team strongly recommends that the use the wrapper compilers to compile and link MPI applications.

If you find yourself saying, "But I don't want to use wrapper compilers!", please humor us and try them. See if they work for you. Be sure to let us know if they do not work for you.

Many people base their "wrapper compilers suck!" mentality on bad behavior from poorly-implemented wrapper compilers in the mid-1990's. Things are much better these days; wrapper compilers can handle almost any situation, and are far more reliable than you attempting to hard-code the Open MPI-specific compiler and linker flags manually.

That being said, there are some -- very, very few -- situations where using wrapper compilers can be problematic -- such as nesting multiple wrapper compilers of multiple projects. Hence, Open MPI provides a workaround to find out what command line flags you need to compile MPI applications. There are generally two sets of flags that you need: compile flags and link flags.

# Show the flags necessary to compile MPI C applications
shell$ mpicc --showme:compile

# Show the flags necessary to link MPI C applications
shell$ mpicc --showme:link

The --showme:* flags work with all Open MPI wrapper compilers (specifically: mpicc, mpiCC / mpicxx / mpic++, mpifort, and if you really must use them, mpif77, mpif90).

Hence, if you need to use some other compiler other than Open MPI's wrapper compilers, we advise you to run the appropriate Open MPI wrapper compiler with the --showme flags to see what Open MPI needs to compile / link, and then use those with your compiler.

NOTE: It is absolutely not sufficient to simply add "-lmpi" to your link line and assume that you will obtain a valid Open MPI executable.

NOTE: It is almost never a good idea to hard-code these results in a Makefile (or other build system). It is almost always best to run (for example) "mpicc --showme:compile" in a dynamic fashion to find out what you need. For example, GNU Make allows running commands and assigning their results to variables:

MPI_COMPILE_FLAGS = $(shell mpicc --showme:compile)
MPI_LINK_FLAGS = $(shell mpicc --showme:link)

my_app: my_app.c
        $(CC) $(MPI_COMPILE_FLAGS) my_app.c $(MPI_LINK_FLAGS) -o my_app


95. How do I override the flags specified by Open MPI's wrapper compilers? (v1.0 series)

NOTE: This answer applies to the v1.0 series of Open MPI only. If you are using a later series, please see this FAQ entry.

The wrapper compilers each construct command lines in the following form:

<compiler> <xCPPFLAGS> <xFLAGS> user_arguments <xLDFLAGS> <xLIBS>

Where <compiler> is replaced by the default back-end compiler for each language, and "x" is customized for each language (i.e., C, C++, F77, and F90).

By setting appropriate environment variables, a user can override default values used by the wrapper compilers. The table below lists the variables for each of the wrapper compilers; the Generic set applies to any wrapper compiler if the corresponding wrapper-specific variable is not set. For example, the value of $OMPI_LDFLAGS will be used with mpicc only if $OMPI_MPICC_LDFLAGS is not set.

Wrapper Compiler Compiler Preprocessor Flags Compiler Flags Linker Flags Linker Library Flags
Generic   OMPI_CPPFLAGS
OMPI_CXXPPFLAGS
OMPI_F77PPFLAGS
OMPI_F90PPFLAGS
OMPI_CFLAGS
OMPI_CXXFLAGS
OMPI_F77FLAGS
OMPI_F90FLAGS
OMPI_LDFLAGS OMPI_LIBS
mpicc OMPI_MPICC OMPI_MPICC_CPPFLAGS OMPI_MPICC_CFLAGS OMPI_MPICC_LDFLAGS OMPI_MPICC_LIBS
mpicxx OMPI_MPIXX OMPI_MPICXX_CXXPPFLAGS OMPI_MPICXX_CXXFLAGS OMPI_MPICXX_LDFLAGS OMPI_MPICXX_LIBS
mpif77 OMPI_MPIF77 OMPI_MPIF77_F77PPFLAGS OMPI_MPIF77_F77FLAGS OMPI_MPIF77_LDFLAGS OMPI_MPIF77_LIBS
mpif90 OMPI_MPIF90 OMPI_MPIF90_F90PPFLAGS OMPI_MPIF90_F90FLAGS OMPI_MPIF90_LDFLAGS OMPI_MPIF90_LIBS

NOTE: If you set a variable listed above, Open MPI will entirely replace the default value that was originally there. Hence, it is advisable to only replace these values when absolutely necessary.


96. How do I override the flags specified by Open MPI's wrapper compilers? (v1.1 series and beyond)

NOTE: This answer applies to the v1.1 and later series of Open MPI only. If you are using the v1.0 series, please see this FAQ entry.

The Open MPI wrapper compilers are driven by text files that contain, among other things, the flags that are passed to the underlying compiler. These text files are generated automatically for Open MPI and are customized for the compiler set that was selected when Open MPI was configured; it is not recommended that users edit these files.

Note that changing the underlying compiler may not work at all. For example, C++ and Fortran compilers are notoriously binary incompatible with each other (sometimes even within multiple releases of the same compiler). If you compile/install Open MPI with C++ compiler XYZ and then use the OMPI_CXX environment variable to change the mpicxx wrapper compiler to use the ABC C++ compiler, your application code may not compile and/or link. The traditional method of using multiple different compilers with Open MPI is to install Open MPI multiple times; each installation should be built/installed with a different compiler. This is annoying, but it is beyond the scope of Open MPI to be able to fix.

However, there are cases where it may be necessary or desireable to edit these files and add to or subtract from the flags that Open MPI selected. These files are installed in $pkgdatadir, which defaults to $prefix/share/openmpi/<wrapper_name>-wrapper-data.txt. A few environment variables are available for run-time replacement of the wrapper's default values (from the text files):

Wrapper Compiler Compiler Preprocessor Flags Compiler Flags Linker Flags Linker Library Flags Data File
Open MPI wrapper compilers
mpicc OMPI_CC OMPI_CPPFLAGS OMPI_CFLAGS OMPI_LDFLAGS OMPI_LIBS mpicc-wrapper-data.txt
mpic++ OMPI_CXX OMPI_CPPFLAGS OMPI_CXXFLAGS OMPI_LDFLAGS OMPI_LIBS mpic++-wrapper-data.txt
mpiCC OMPI_CXX OMPI_CPPFLAGS OMPI_CXXFLAGS OMPI_LDFLAGS OMPI_LIBS mpiCC-wrapper-data.txt
mpifort OMPI_FC OMPI_CPPFLAGS OMPI_FCFLAGS OMPI_LDFLAGS OMPI_LIBS mpifort-wrapper-data.txt
mpif77 (deprecated as of v1.7) OMPI_F77 OMPI_CPPFLAGS OMPI_FFLAGS OMPI_LDFLAGS OMPI_LIBS mpif77-wrapper-data.txt
mpif90 (deprecated as of v1.7) OMPI_FC OMPI_CPPFLAGS OMPI_FCFLAGS OMPI_LDFLAGS OMPI_LIBS mpif90-wrapper-data.txt
OpenRTE wrapper compilers
ortecc ORTE_CC ORTE_CPPFLAGS ORTE_CFLAGS ORTE_LDFLAGS ORTE_LIBS ortecc-wrapper-data.txt
ortec++ ORTE_CXX ORTE_CPPFLAGS ORTE_CXXFLAGS ORTE_LDFLAGS ORTE_LIBS ortec++-wrapper-data.txt
OPAL wrapper compilers
opalcc OPAL_CC OPAL_CPPFLAGS OPAL_CFLAGS OPAL_LDFLAGS OPAL_LIBS opalcc-wrapper-data.txt
opalc++ OPAL_CXX OPAL_CPPFLAGS OPAL_CXXFLAGS OPAL_LDFLAGS OPAL_LIBS opalc++-wrapper-data.txt

Note that the values of these fields can be directly influenced by passing flags to Open MPI's configure script. The following options are available to configure:

  • --with-wrapper-cflags: Extra flags to add to CFLAGS when using mpicc.

  • --with-wrapper-cxxflags: Extra flags to add to CXXFLAGS when using mpiCC.

  • --with-wrapper-fflags: Extra flags to add to FFLAGS when using mpif77 (this option has disappeared in Open MPI 1.7 and will not return; see this FAQ entry for more details).

  • --with-wrapper-fcflags: Extra flags to add to FCFLAGS when using mpif90 and mpifort.

  • --with-wrapper-ldflags: Extra flags to add to LDFLAGS when using any of the wrapper compilers.

  • --with-wrapper-libs: Extra flags to add to LIBS when using any of the wrapper compilers.

The files cited in the above table are fairly simplistic "key=value" data formats. The following are several fields that are likely to be interesting for end-users:

  • project_short: Prefix for all environment variables. See below.
  • compiler_env: Specifies the base name of the environment variable that can be used to override the wrapper's underlying compiler at run-time. The full name of the environment variable is of the form <project_short>_<compiler_env>; see table above.
  • compiler_flags_env: Specifies the base name of the environment variable that can be used to override the wrapper's compiler flags at run-time. The full name of the environment variable is of the form <project_short>_<compiler_flags_env>; see table above.
  • compiler: The executable name of the underlying compiler.
  • extra_includes: Relative to $installdir, a list of directories to also list in the preprocessor flags to find header files.
  • preprocessor_flags: A list of flags passed to the preprocessor.
  • compiler_flags: A list of flags passed to the compiler.
  • linker_flags: A list of flags passed to the linker.
  • libs: A list of libraries passed to the linker.
  • required_file: If non-empty, check for the presence of this file before continuing. If the file is not there, the wrapper will abort saying that the language is not supported.
  • includedir: Directory containing Open MPI's header files. The proper compiler "include" flag is prepended to this directory and added into the preprocessor flags.
  • libdir: Directory containing Open MPI's library files. The proper compiler "include" flag is prepended to this directory and added into the linker flags.
  • module_option: This field only appears in mpif90. It is the flag that the Fortran 90 compiler requires to declare where module files are located.


97. How can I tell what the wrapper compiler default flags are?

If the corresponding environment variables are not set, the wrappers will add -I$includedir and -I$includedir/openmpi (which usually map to $prefix/include and $prefix/include/openmpi, respectively) to the xFLAGS area, and add -L$libdir (which usually maps to $prefix/lib) to the xLDFLAGS area.

To obtain the values of the other flags, there are two main methods:

  1. Use the --showme option to any wrapper compiler. For example (lines broken here for readability):

    shell$ mpicc prog.c -o prog --showme
    gcc -I/path/to/openmpi/include -I/path/to/openmpi/include/openmpi/ompi \
    prog.c -o prog -L/path/to/openmpi/lib -lmpi \
    -lopen-rte -lopen-pal -lutil -lnsl -ldl -Wl,--export-dynamic -lm
    

    This shows a coarse-grained method for getting the entire command line, but does not tell you what each set of flags are (xFLAGS, xCPPFLAGS, xLDFLAGS, and xLIBS).

  2. Use the ompi_info command. For example:

    shell$ ompi_info --all | grep wrapper
       Wrapper extra CFLAGS:
     Wrapper extra CXXFLAGS:
       Wrapper extra FFLAGS:
      Wrapper extra FCFLAGS:
      Wrapper extra LDFLAGS: 
         Wrapper extra LIBS: -lutil -lnsl -ldl -Wl,--export-dynamic -lm
    

    This installation is only adding options in the xLIBS areas of the wrapper compilers; all other values are blank (remember: the -I's and -L's are implicit).

    Note that the --parsable option can be used to obtain machine-parsable versions of this output. For example:

    shell$ ompi_info --all --parsable | grep wrapper:extra
    option:wrapper:extra_cflags:
    option:wrapper:extra_cxxflags:
    option:wrapper:extra_fflags:
    option:wrapper:extra_fcflags:
    option:wrapper:extra_ldflags:
    option:wrapper:extra_libs:-lutil -lnsl  -ldl  -Wl,--export-dynamic -lm
    


98. Why does "mpicc --showme <some flags>" not show any MPI-relevant flags?

The output of commands similar to the following may be somewhat surprising:

shell$ mpicc -g --showme
gcc -g
shell$

Where are all the MPI-related flags, such as the necessary -I, -L, and -l flags?

The short answer is that these flags are not included in the wrapper compiler's underlying command line unless the wrapper compiler sees a filename argument. Specifically (output artifically wrapped below for readability)

shell$ mpicc -g --showme
gcc -g
shell$ mpicc -g foo.c --showme
gcc -I/opt/openmpi/include/openmpi -I/opt/openmpi/include -g foo.c
-Wl,-u,_munmap -Wl,-multiply_defined,suppress -L/opt/openmpi/lib -lmpi
-lopen-rte -lopen-pal -ldl

The second command had the filename "foo.c" in it, so the wrapper added all the relevant flags. This behavior is specifically to allow behavior such as the following:

shell$ mpicc --version --showme
gcc --version
shell$ mpicc --version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5363)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

shell$

That is, the wrapper compiler does not behave differently when constructing the underlying command line if "--showme" is used or not. The only difference is whether the resulting command line is displayed or executed.

Hence, this behavior allows users to pass arguments to the underlying compiler without intending to actually compile or link (such as passing --version to query the underlying compiler's version). If the wrapper compilers added more flags in these cases, some underlying compilers emit warnings.


99. Are there ways to just add flags to the wrapper compilers?

Yes!

Open MPI's configure script allows you to add command line flags to the wrappers on a permanent basis. The following configure options are available:

These configure options can be handy if you have some optional compiler/linker flags that you need both Open MPI and all MPI applications to be compiled with. Rather than trying to get all your users to remember to pass the extra flags to the compiler when compiling their applications, you can specify them with the configure options shown above, thereby silently including them in the Open MPI wrapper compilers -- your users will therefore be using the correct flags without ever knowing it.


100. Why don't the wrapper compilers add "-rpath" (or similar) flags by default?

The default installation of Open MPI tries very hard to not include any non-essential flags in the wrapper compilers. This is the most conservative setting and allows the greatest flexability for end-users. If the wrapper compilers started adding flags to support specific features (such as run-time locations for finding the Open MPI libraries), such flags -- no matter how useful to some portion of users -- would almost certainly break assumptions and functionality for other users.

As a workaround, Open MPI provides several mechanisms for users to manually override the flags in the wrapper compilers:

  1. First and simplest, you can add your own flags to the wrapper compiler command line by simply listing them on the command line. For example:

    shell$ mpicc my_mpi_application.c -o my_mpi_application -rpath /path/to/openmpi/install/lib
    

  2. Use the --showme options to the wrapper compilers to dynamically see what flags the wrappers are adding, and modify them as appropiate. See this FAQ entry for more details.
  3. Use environment variables to override the arguments that the wrappers insert. If you are using Open MPI 1.0.x, this FAQ entry, otherwise see this FAQ entry.
  4. If you are using Open MPI 1.1 or layer, you can modify text files that provide the system-wide default flags for the wrapper compilers. this FAQ entry for more details.
  5. If you are using Open MPI 1.1 or layer, you can pass additional flags in to the system-wide wrapper compiler default flags through Open MPI's configure script. See this FAQ entry for more details.

You can use one of more of these methods to insert your own flags (such as "-rpath" or similar).


101. Can I build 100% static MPI applications?

Fully static linking is not for the weak, and it is not recommended. But it is possible, with some caveats.

  1. You must have static libraries available for everything that your program links to. This includes Open MPI; you must have used the --enable-static option to Open MPI's configure or otherwise have available the static versions of the Open MPI libraries (note that Open MPI static builds default to including all of its plugins in its libraries -- as opposed to having each plugin in its own dynamic shared object file. So all of Open MPI's code will be contained in the static libraries -- even what are normally contained in Open MPI's plugins). Note that some popular Linux libraries do not have static versions by default (e.g., libnuma), or require additional RPMs to be installed to get the equivalent libraries.
  2. Open MPI must have been built without a memory manager. This means that Open MPI must have been configured with the --without-memory-manager flag. This is irrelevant on some platforms for which Open MPI does not have a memory manager, but on some platforms it is necessary (Linux). It is harmless to use this flag on platforms where Open MPI does not have a memory manager. Not having a memory manager means that Open MPI's mpi_leave_pinned behavior for OS-bypass networks such as InfiniBand will not work.
  3. On some systems (Linux), you may see linker warnings about some files requiring dynamic libraries for functions such as gethostname and dlopen. These are ok, but do mean that you need to have the shared libraries installed. You can disable all of Open MPI's dlopen behavior (i.e., prevent it from trying to open any plugins) by specifying the --disable-dlopen flag to Open MPI's configure script). This will eliminate the linker warnings about dlopen.

For example, this is how to configure Open MPI to build static libraries on Linux:

shell$ ./configure --without-memory-manager --without-libnuma \
  --enable-static [...your other configure arguments...]

Some systems may have additional constraints about their support libraries that require additional steps to produce working 100% static MPI applications. For example, the libibverbs support library from OpenIB / OFED has its own plugin system (which, by default, won't work with an otherwise-static application); MPI applications need additional compiler/linker flags to be specified to create a working 100% MPI application. See this FAQ entry for the details.


102. Can I build 100% static OpenFabrics / OpenIB / OFED MPI applications on Linux?

Fully static linking is not for the weak, and it is not recommended. But it is possible. First, you must read this FAQ entry.

For an OpenFabrics / OpenIB / OFED application to be built statically, you must have libibverbs v1.0.4 or later (v1.0.4 was released after OFED 1.1, so if you have OFED 1.1, you will manually need to upgrade your libibverbs). Both libibverbs and your verbs hardware plugin must be available in static form.

Once all of that has been setup, run the following (artificially wrapped sample output shown below -- your output may be slightly different):

shell$ mpicc your_app.c -o your_app --showme
gcc -I/opt/openmpi/include/openmpi \
-I/opt/openmpi/include -pthread ring.c -o ring \
-L/usr/local/ofed/lib -L/usr/local/ofed/lib64/infiniband \
-L/usr/local/ofed/lib64 -L/opt/openmpi/lib -lmpi -lopen-rte \
-lopen-pal -libverbs -lrt -Wl,--export-dynamic -lnsl -lutil -lm -ldl

(or use whatever wrapper compiler is relevant -- the --showme flag is the important part here)

This example shows the steps for the GNU compiler suite, but other compilers will be similar. This example also assumes that the OpenFabrics / OpenIB / OFED install was rooted at /usr/local/ofed; some distributions install under /usr/ofed (or elsewhere). Finally, some installations use the library directory "lib64" while others use "lib". Adjust your directory names as appropriate.

Take the output of from the above command and run it manually to compile and link your application, adding the following hilighted arguments:

shell$ gcc -static -I/opt/openmpi/include/openmpi \
  -I/opt/openmpi/include -pthread ring.c -o ring \
  -L/usr/local/ofed/lib -L/usr/local/ofed/lib64/infiniband \
  -L/usr/local/ofed/lib64 -L/opt/openmpi/lib -lmpi -lopen-rte \
  -lopen-pal -Wl,--whole-archive -libverbs /usr/local/ofed/lib64/infiniband/mthca.a \
  -Wl,--no-whole-archive -lrt -Wl,--export-dynamic -lnsl -lutil \
  -lm -ldl

Note that the mthca.a file is the verbs plugin for Mellanox HCAs. If you have an HCA from a different vendor (such as IBM or QLogic), use the appropriate filename (look in $ofed_libdir/infiniband for verbs plugin files for your hardware).

Specifically, these added arguments do the following:

  • -static: Tell the linker to generate a static executable.
  • -Wl,--whole-archive: Tell the linker to include the entire ibverbs library in the executable.
  • $ofed_root/lib64/infiniband/mthca.a: Include the Mellanox verbs plugin in the executable.
  • -Wl,--no-whole-archive: Tell the linker the return to the default of not including entire libraries in the executable.

You can either add these arguments in manually, or you can see this FAQ entry to modify the default behavior of the wrapper compilers to hide this complexity from end users (but be aware that if modify the wrapper compilers default behavior, all users will be creating static applications!).


103. Why does it take soooo long to compile F90 MPI applications?

NOTE: Starting with Open MPI v1.7, if you are not using gfortran, buidling the Fortran 90 and 08 bindings do not suffer the same performance penalty that previous versions incurred. The Open MPI developers encourage all users to upgrade to the new Fortran bindings implementation -- including the new MPI-3 Fortran'08 bindings -- when possible.

This is unfortunately due to a design flaw in the MPI F90 bindings themselves.

The answer to this question is exactly the same as it is for why it takes so long to compile the MPI F90 bindings in the Open MPI implementation; please see this FAQ entry for the details.


104. How do I build BLACS with Open MPI?

The blacs_install.ps file (available from that web site) describes how to build BLACS, so we won't repeat much of it here (especially since it might change in future versions). These instructions only pertain to making Open MPI work correctly with BLACS.

After selecting the appropriate starting Bmake.inc, make the following changes to Sections 1, 2, and 3. The example below is from the Bmake.MPI-SUN4SOL2; your Bmake.inc file may be different.

# Section 1:
# Ensure to use MPI for the communication layer

   COMMLIB = MPI

# The MPIINCdir macro is used to link in mpif.h and
# must contain the location of Open MPI's mpif.h.  
# The MPILIBdir and MPILIB macros are irrelevant 
# and should be left empty.

   MPIdir = /path/to/openmpi-1.8.1
   MPILIBdir =
   MPIINCdir = $(MPIdir)/include
   MPILIB =

# Section 2:
# Set these values:

   SYSINC =
   INTFACE = -Df77IsF2C
   SENDIS =
   BUFF =
   TRANSCOMM = -DUseMpi2
   WHATMPI =
   SYSERRORS =

# Section 3:
# You may need to specify the full path to
# mpif77 / mpicc if they aren't already in
# your path.
 
   F77            = mpif77
   F77LOADFLAGS   = 

   CC             = mpicc
   CCLOADFLAGS    = 

The remainder of the values are fairly obvious and irrelevant to Open MPI; you can set whatever optimization level you want, etc.

If you follow the rest of the instructions for building, BLACS will build correctly and use Open MPI as its MPI communication layer.


105. How do I build ScaLAPACK with Open MPI?

The scalapack_install.ps file (available from that web site) describes how to build ScaLAPACK, so we won't repeat much of it here (especially since it might change in future versions). These instructions only pertain to making Open MPI work correctly with ScaLAPACK. These instructions assume that you have built and installed BLACS with Open MPI.

# Make sure you follow the instructions to build BLACS with Open MPI,
# and put its location in the following.

   BLACSdir      = <path where you installed BLACS>

# The MPI section is commented out.  Uncomment it. The wrapper
# compiler will handle SMPLIB, so make it blank. The rest are correct
# as is.

   USEMPI        = -DUsingMpiBlacs
   SMPLIB        = 
   BLACSFINIT    = $(BLACSdir)/blacsF77init_MPI-$(PLAT)-$(BLACSDBGLVL).a
   BLACSCINIT    = $(BLACSdir)/blacsCinit_MPI-$(PLAT)-$(BLACSDBGLVL).a
   BLACSLIB      = $(BLACSdir)/blacs_MPI-$(PLAT)-$(BLACSDBGLVL).a
   TESTINGdir    = $(home)/TESTING

# The PVMBLACS setup needs to be commented out.

   #USEMPI        =
   #SMPLIB        = $(PVM_ROOT)/lib/$(PLAT)/libpvm3.a -lnsl -lsocket
   #BLACSFINIT    =
   #BLACSCINIT    =
   #BLACSLIB      = $(BLACSdir)/blacs_PVM-$(PLAT)-$(BLACSDBGLVL).a
   #TESTINGdir    = $(HOME)/pvm3/bin/$(PLAT)

# Make sure that the BLASLIB points to the right place.  We built this
# example on Solaris, hence the name below.  The Linux version of the
# library (as of this writing) is blas_LINUX.a.

   BLASLIB       = $(LAPACKdir)/blas_solaris.a

# You may need to specify the full path to mpif77 / mpicc if they
# aren't already in your path.
 
   F77            = mpif77
   F77LOADFLAGS   = 

   CC             = mpicc
   CCLOADFLAGS    = 

The remainder of the values are fairly obvious and irrelevant to Open MPI; you can set whatever optimization level you want, etc.

If you follow the rest of the instructions for building, ScaLAPACK will build correctly and use Open MPI as its MPI communication layer.


106. How do I build PETSc with Open MPI?

The only special configuration that you need to build PETSc is to ensure that Open MPI's wrapper compilers (i.e., mpicc and mpif77) are in your $PATH before running the PETSc configure.py script.

PETSc should then automatically find Open MPI's wrapper compilers and correctly build itself using Open MPI.


107. How do I build VASP with Open MPI?

The following was reported by an Open MPI user who was able to successfully build and run VASP with Open MPI:

I just compiled the latest VASP v4.6 using Open MPI v1.2.1, ifort v9.1, ACML v3.6.0, BLACS with patch-03 and Scalapack v1.7.5 built with ACML.

I configured Open MPI with --enable-static flag.

I used the VASP supplied makefile.linux_ifc_opt and only corrected the paths to the ACML, scalapack, and BLACS dirs (I didn't lower the optimization to -O0 for mpi.f like I suggested before). The -D's are standard except I get a little better performance with -DscaLAPACK (I tested it with out this option too):

CPP    = $(CPP_) -DMPI  -DHOST="LinuxIFC" -DIFC \
     -Dkind8 -DNGZhalf -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc \
     -DMPI_BLOCK=2000  \
     -Duse_cray_ptr -DscaLAPACK

Also, Blacs and Scalapack used the -D's suggested in the Open MPI FAQ.


108. Are other language / application bindings available for Open MPI?

Other MPI language bindings and application-level programming interfaces have been been written by third parties. Here are a link to some of the available packages:

...we used to maintain a list of links here. But the list changes over time; projects come, and projects go. Your best bet these days is simply to use Google to find MPI bindings and application-level programming interfaces.


109. What pre-requisites are necessary for running an Open MPI job?

In general, Open MPI requires that its executables are in your PATH on every node that you will run on and if Open MPI was compiled as dynamic libraries (which is the default), the directory where its libraries are located must be in your LD_LIBRARY_PATH on every node.

Specifically, if Open MPI was installed with a prefix of /opt/openmpi, then the following should be in your PATH and LD_LIBRARY_PATH

PATH:            /opt/openmpi/bin
LD_LIBRARY_PATH: /opt/openmpi/lib

Depending on your environment, you may need to set these values in your shell startup files (e.g., .profile, .cshrc, etc.).

NOTE: there are exceptions to this rule -- notably the --prefix option to mpirun.

See this FAQ entry for more details on how to add Open MPI to your PATH and LD_LIBRARY_PATH.

Additionally, Open MPI requires that jobs can be started on remote nodes without any input from the keyboard. For example, if using rsh or ssh as the remote agent, you must have your environment setup to allow execution on remote nodes without entering a password or passphrase.


110. What ABI guarantees does Open MPI provide?

Open MPI's versioning and ABI scheme is described here, but is summarized here in this FAQ entry for convenience.

Open MPI provided forward application binary interface (ABI) compatibility for MPI applications starting with v1.3.2. Prior to that version, no ABI guarantees were provided.

NOTE: Prior to v1.3.2, subtle and strange failures are almost guaranteed to occur if applications were compiled and linked against shared libraries from one version of Open MPI and then run with another. The Open MPI team strongly discourages making any ABI assumptions before v1.3.2.

NOTE: ABI for the "use mpi" Fortran interface was inadvertantly broken in the v1.6.3 release, and was restored in the v1.6.4 release. Any Fortran applications that utilize the "use mpi" MPI interface that were compiled and linked against the v1.6.3 release will not be link-time compatible with other releases in the 1.5.x / 1.6.x series. Such applications remain source compatible, however, and can be recompiled/re-linked with other Open MPI releases.

Starting with v1.3.2, Open MPI provides forward ABI compatibility -- with respect to the MPI API only -- in all versions of a given feature release series and its corresponding super stable series. For example, on a single platform, an MPI application linked against Open MPI v1.3.2 shared libraries can be updated to point to the shared libraries in any successive v1.3.x or v1.4 release and still work properly (e.g., via the LD_LIBRARY_PATH environment variable or other operating system mechanism).

For the v1.5 series, this means that all releases of v1.5.x and v1.6.x will be ABI compatible, per the above definition.

Open MPI reserves the right to break ABI compatibility at new feature release series. For example, the same MPI application from above (linked against Open MPI v1.3.2 shared libraries) will not work with Open MPI v1.5 shared libraries. Similarly, MPI applications compiled/linked against Open MPI 1.6.x will not be ABI compatible with Open MPI 1.7.x


111. Do I need a common filesystem on all my nodes?

No, but it certainly makes life easier if you do.

A common environment to run Open MPI is in a "Beowulf"-class or similar cluster (e.g., a bunch of 1U servers in a bunch of racks). Simply stated, Open MPI can run on a group of servers or workstations connected by a network. As mentioned above, there are several prerequisites, however (for example, you typically must have an account on all the machines, you can ssh or ssh between the nodes without using a password etc.).

Regardless of whether Open MPI is installed on a shared / networked filesystem or independently on each node, it is usually easiest if Open MPI is available in the same filesystem location on every node. For example, if you install Open MPI to /opt/openmpi-1.8.1 on one node, ensure that it is available in /opt/openmpi-1.8.1 on all nodes.

This FAQ entry has a bunch more information about installation locations for Open MPI.


112. How do I add Open MPI to my PATH and LD_LIBRARY_PATH?

Open MPI must be able to find its executables in your PATH on every node (if Open MPI was compiled as dynamic libraries, then its library path must appear in LD_LIBRARY_PATH as well). As such, your configuration/initialization files need to add Open MPI to your PATH / LD_LIBRARY_PATH properly.

How to do this may be highly dependent upon your local configuration, so you may need to consult with your local system administrator. Some system administrators take care of these details for you, some don't. YMMV. Some common examples are included below, however.

You must have at least a minimum understanding of how your shell works to get Open MPI in your PATH / LD_LIBRARY_PATH properly. Note that Open MPI must be added to your PATH and LD_LIBRARY_PATH in two situations: (1) when you login to an interactive shell, (2) and when you login to non-interactive shells on remote nodes.

  • If (1) is not configured properly, executables like mpicc will not be found, and it is typically obvious what is wrong. The Open MPI executable directory can manually be added to the PATH, or the user's startup files can be modified such that the Open MPI executables are added to the PATH every login. This latter approach is preferred.

    All shells have some kind of script file that is executed at login time to set things like PATH and LD_LIBRARY_PATH and perform other environmental setup tasks. This startup file is the one that needs to be edited to add Open MPI to the PATH and LD_LIBRARY_PATH. Consult the manual page for your shell for specific details (some shells are picky about the permissions of the startup file, for example). The table below lists some common shells and the startup files that they read/execute upon login:

    Shell Interactive login startup file
    sh (Bourne shell, or bash named "sh") .profile
    csh .cshrc followed by .login
    tcsh .tcshrc if it exists, .cshrc if it does not, followed by .login
    bash .bash_profile if it exists, or .bash_login if it exists, or .profile if it exists (in that order). Note that some Linux distributions automatically come with .bash_profile scripts for users that automatically execute .bashrc as well. Consult the bash man page for more information.

  • If (2) is not configured properly, executables like mpirun will not function properly, and it can be somewhat confusing to figure out (particularly for bash users).

    The startup files in question here are the ones that are automatically executed for a non-interactive login on a remote node (e.g., "rsh othernode ps"). Note that not all shells support this, and that some shells use different files for this than listed in (1). Some shells will supersede (2) with (1). That is, fulfilling (2) may automatically fulfill (1). The following table lists some common shells and the startup file that is automatically executed, either by Open MPI or by the shell itself:

    Shell Non-interactive login startup file
    sh (Bourne or bash named "sh") This shell does not execute any file automatically, so Open MPI will execute the .profile script before invoking Open MPI executables on remote nodes
    csh .cshrc
    tcsh .tcshrc if it exists, or .cshrc if it does not
    bash .bashrc if it exists


113. What if I can't modify my PATH and/or LD_LIBRARY_PATH?

There are some situations where you cannot modify the PATH or LD_LIBRARY_PATH -- e.g., some ISV application prefer to hide all parallelism from the user, and therefore do not want to make the user modify their shell startup files. Another case is where you want a single user to be able to launch multiple MPI jobs simultaneously, each with a different MPI implementation. Hence, setting shell startup files to point to one MPI implementation would be problematic.

In such cases, you have two options:

  1. Use mpirun's --prefix command line option (described below).
  2. Modify the wrapper compilers to include directives to include run-time search locations for the Open MPI libraries (see this FAQ entry)

mpirun's --prefix command line option takes as an argument the top-level directory where Open MPI was installed. While relative directory names are possible, they can become ambiguous depending on the job launcher used; using absolute directory names are strongly recommended.

For example, say that Open MPI was installed into /opt/openmpi-1.8.1. You would use the --prefix option like this:

shell$ mpirun --prefix /opt/openmpi-1.8.1 -np 4 a.out

This will prefix the PATH and LD_LIBRARY_PATH on both the local and remote hosts with /opt/openmpi-1.8.1/bin and /opt/openmpi-1.8.1/lib, respectively. This is usually unnecessary when using resource managers to launch jobs (e.g., SLURM, Torque, etc.) because they tend to copy the entire local environment -- to include the PATH and LD_LIBRARY_PATH -- to remote nodes before execution. As such, if PATH and LD_LIBRARY_PATH are set properly on the local node, the resource manager will automatically propagate those values out to remote nodes. The --prefix option is therefore usually most useful in rsh or ssh-based environments (or similar).

Beginning with the 1.2 series, it is possible to make this the default behavior by passing to configure the flag --enable-mpirun-prefix-by-default. This will make mpirun behave exactly the same as "mpirun --prefix $prefix ...", where $prefix is the value given to --prefix in configure.

Finally, note that specifying the absolute pathname to mpirun is equivalent to using the --prefix argument. For example, the following is equivalent to the above command line that uses --prefix:

shell$ /opt/openmpi-1.8.1/bin/mpirun -np 4 a.out


114. How do I launch Open MPI parallel jobs?

Similar to many MPI implementations, Open MPI provides the commands mpirun and mpiexec to launch MPI jobs. Several of the questions in this FAQ category deal with using these commands.

Note, however, that these commands are exactly identical. Specifically, they are symbolic links to a common back-end launcher command named orterun (Open MPI's run-time environment interaction layer is named the Open Run-Time Environment, or ORTE -- hence orterun).

As such, the rest of this FAQ usually refers only to mpirun, even though the same discussions also apply to mpiexec and orterun (because they are all, in fact, the same command).


115. How do I run a simple SPMD MPI job?

Open MPI provides both mpirun and mpiexec commands. A simple way to start a single program, multiple data (SPMD) application in parallel is:

shell$ mpirun -np 4 my_parallel_application

This starts a four-process parallel application, running four copies of the executable named my_parallel_application.

The rsh starter component accepts the --hostfile (also known as --machinefile) option to indicate which hosts to start the processes on:

shell$ cat my_hostfile
host01.example.com
host02.example.com
shell$ mpirun --hostfile my_hostfile -np 4 my_parallel_application

This command will launch one copy of my_parallel_application on each of host01.example.com and host02.example.com.

More information about the --hostfile option, and hostfiles in general, is available in this FAQ entry.

Note, however, that not all environments require a hostfile. For example, Open MPI will automatically detect when it is running in batch / scheduled environments (such as SGE, PBS/Torque, SLURM, and LoadLeveler), and will use host information provided by those systems.

Also note that if using a launcher that requires a hostfile and no hostfile is specified, all processes are launched on the local host.


116. How do I run an MPMD MPI job?

Both the mpirun and mpiexec commands support multiple program, multiple data (MPMD) style launches, either from the command line or from a file. For example:

shell$ mpirun -np 2 a.out : -np 2 b.out

This will launch a single parallel application, but the first two processes will be instances of the a.out executable, and the second two processes will be instances of the b.out executable. In MPI terms, this will be a single MPI_COMM_WORLD, but the a.out processes will be ranks 0 and 1 in MPI_COMM_WORLD, while the b.out processes will be ranks 2 and 3 in MPI_COMM_WORLD.

mpirun (and mpiexec) can also accept a parallel application specified in a file instead of on the command line. For example:

shell$ mpirun --app my_appfile

where the file my_appfile contains the following:

# Comments are supported; comments begin with #
# Application context files specify each sub-application in the
# parallel job, one per line.  The first sub-application is the 2
# a.out processes:
-np 2 a.out
# The second sub-application is the 2 b.out processes:
-np 2 b.out

This will result in the same behavior as running a.out and b.out from the command line.

Note that mpirun and mpiexec are identical in command-line options and behavior; using the above command lines with mpiexec instead of mpirun will result in the same behavior.


117. How do I specify the hosts on which my MPI job runs?

There are three general mechanisms:

  1. The --hostfile option to mpirun. Use this option to specify a list of hosts on which to run. Note that for compatibility with other MPI implementations, --machinefile is a synonym for --hostfile. See this FAQ entry for more information about the --hostfile option.
  2. The --host option to mpirun can be used to specify a list of hosts on which to run on the command line. See this FAQ entry for more information about the --host option.
  3. If you are running in a scheduled environment (e.g., in a SLURM, Torque, or LSF job), Open MPI will automatically get the lists of hosts from the scheduler.

NOTE: The specification of hosts using any of the above methods has nothing to do with the network interfaces that are used for MPI traffic. The list of hosts is only used for specifying which hosts on which to launch MPI processes.


118. I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why?

(you should probably also see this FAQ entry, too)

If you can run ompi_info and possibly even launch MPI processes locally, but fail to launch MPI processes on remote hosts, it is likely that you do not have your PATH and/or LD_LIBRARY_PATH setup properly on the remote nodes.

Specifically, the Open MPI commands usually run properly even if LD_LIBRARY_PATH is not set properly because they encode the Open MPI library location in their executables and search there by default. Hence, running ompi_info (and friends) usually works, even in some improperly setup environments.

However, Open MPI's wrapper compilers do not encode the Open MPI library locations in MPI executables by default (the wrappers only specify a bare minimum of flags necessary to create MPI executables; we consider any flags beyond this bare minimum set a local policy decision). Hence, attempting to launch MPI executables in environments where LD_LIBRARY_PATH is either not set or was set improperly may result in messages about libmpi.so not being found.

You can change Open MPI's wrapper compiler behavior to specify the run-time location of Open MPI's libraries, if you wish.

Depending on how Open MPI was configured and/or invoked, it may even be possible to run MPI applications in environments where PATH and/or LD_LIBRARY_PATH is not set, or is set improperly. This can be desirable for environments where multiple MPI implementations are installed, such as multiple versions of Open MPI.


119. How can I diagnose problems when running across multiple hosts?

In addition to what is mentioned in this FAQ entry, when you are able to run MPI jobs on a single host, but fail to run them across multiple hosts, try the following:

  1. Ensure that your launcher is able to launch across multiple hosts. For example, if you are using ssh, try to ssh to each remote host and ensuring that you are not prompted for a password. For example:

    shell$ ssh remotehost hostname
    remotehost
    

    If you are unable to launch across multiple hosts, check that your SSH keys are setup properly. Or, if you are running in a managed environment, such as in a SLURM, Torque, or other job launcher, check that you have reserved enough hosts, are running in an allocated job, etc.

  2. Ensure that your PATH and LD_LIBRARY_PATH are set correctly on each remote host on which you are trying to run. For example, with ssh:

    shell$ ssh remotehost env | grep -i path
    PATH=...path on the remote host...
    LD_LIBRARY_PATH=...LD library path on the remote host...
    

    If your PATH or LD_LIBRARY_PATH are not set properly, see this FAQ entry for the correct values. Keep in mind that it is fine to have multiple Open MPI installations installed on a machine; the first Open MPI installation found by PATH and LD_LIBARY_PATH is the one that matters.

  3. Run a simple, non-MPI job across multiple hosts. This verifies that the Open MPI run-time system is functioning properly across multiple hosts. For example, try running the hostname command:

    shell$ mpirun --host remotehost hostname
    remotehost
    shell$ mpirun --host remotehost,otherhost hostname
    remotehost
    otherhost
    

    If you are unable to run non-MPI jobs across multiple hosts, check for common problems such as:

    1. Check that your non-interactive shell setup on each remote host to ensure that it is setting up the PATH and LD_LIBRARY_PATH properly.
    2. Check that Open MPI is finding and launching the correct version of Open MPI on the remote hosts.
    3. Ensure that you have firewalling disabled between hosts (Open MPI opens random TCP and sometimes random UDP ports between hosts in a single MPI job).
    4. Try running with the plm_base_verbose MCA parameter at level 10, which will enable extra debugging output to see how Open MPI launches on remote hosts. For example: mpirun --mca plm_base_verbose 10 --host remotehost hostname

  4. Now run a simple MPI job across multiple hosts that does not involve MPI communications. The "hello_c" program in the examples directory in the Open MPI distribution is a good choice. This verifies that the MPI subsystem is able to initialize and terminate properly. For example:

    shell$ mpirun --host remotehost,otherhost hello_c
    Hello, world, I am 0 of 1, (Open MPI v1.7.5, package: Open MPI jsquyres@builder.cisco.com Distribution, ident: 1.7.5, Mar 20, 2014, 99)
    Hello, world, I am 1 of 1, (Open MPI v1.7.5, package: Open MPI jsquyres@builder.cisco.com Distribution, ident: 1.7.5, Mar 20, 2014, 99)
    

    If you are unable to run simple, non-communication MPI jobs, this can indicate that your Open MPI installation is unable to initialize properly on remote hosts. Double check your non-interactive login setup on remote hosts.

  5. Now run a simple MPI job across multiple hosts that does does some simple MPI communications. The "ring_c" program in the examples directory in the Open MPI distribution is a good choice. This verifies that the MPI subsystem is able to pass MPI traffic across your network. For example:

    shell$ mpirun --host remotehost,otherhost ring_c
    Process 0 sending 10 to 0, tag 201 (1 processes in ring)
    Process 0 sent to 0
    Process 0 decremented value: 9
    Process 0 decremented value: 8
    Process 0 decremented value: 7
    Process 0 decremented value: 6
    Process 0 decremented value: 5
    Process 0 decremented value: 4
    Process 0 decremented value: 3
    Process 0 decremented value: 2
    Process 0 decremented value: 1
    Process 0 decremented value: 0
    Process 0 exiting
    

    If you are unable to run simple MPI jobs across multiple hosts, this may indicate a problem with the network(s) that Open MPI is trying to use for MPI communications. Try limiting the networks that it uses, and/or exploring levels 1 through 3 MCA parameters for the communications module that you are using. For example, if you're using the TCP BTL, see the output of ompi_info --level 3 --param btl tcp .


120. When I build Open MPI with the Intel compilers, I get warnings about "orted" or my MPI application not finding libimf.so. What do I do?

The problem is usually because the Intel libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libimf.so, is an Intel compiler library. As such, it is likely that the user did not setup the Intel compiler library in their environment properly on this node.

Double check that you have setup the Intel compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the Intel compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com
Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit
shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory

The above example shows that running a trivial C program compiled by the Intel compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the Intel compiler environment is setup properly for non-interactive logins.


121. When I build Open MPI with the PGI compilers, I get warnings about "orted" or my MPI application not finding libpgc.so. What do I do?

The problem is usually because the PGI libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libpgc.so, is a PGI compiler library. As such, it is likely that the user did not setup the PGI compiler library in their environment properly on this node.

Double check that you have setup the PGI compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the PGI compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com
Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit
shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory

The above example shows that running a trivial C program compiled by the PGI compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the PGI compiler environment is setup properly for non-interactive logins.


122. When I build Open MPI with the Pathscale compilers, I get warnings about "orted" or my MPI application not finding libmv.so. What do I do?

The problem is usually because the Pathscale libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libmv.so, is a Pathscale compiler library. As such, it is likely that the user did not setup the Pathscale compiler library in their environment properly on this node.

Double check that you have setup the Pathscale compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the Pathscale compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com
Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit
shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory

The above example shows that running a trivial C program compiled by the Pathscale compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the Pathscale compiler environment is setup properly for non-interactive logins.


123. Can I run non-MPI programs with mpirun / mpiexec?

Yes.

Indeed, Open MPI's mpirun and mpiexec are actually synonyms for our underlying launcher named orterun (i.e., the Open Run-Time Environment layer in Open MPI, or ORTE). So you can use mpirun and mpiexec to launch any application. For example:

shell$ mpirun -np 2 --host a,b uptime

This will launch a copy of the unix command uptime on the hosts a and b.

Other questions in the FAQ section deal with the specifics of the mpirun command line interface; suffice it to say that it works equally well for MPI and non-MPI applications.


124. Can I run GUI applications with Open MPI?

Yes, but it will depend on your local setup and may require additional setup.

In short: you will need to have X forwarding enabled from the remote processes to the display where you want output to appear. In a secure environment, you can simply allow all X requests to be shown on the target display and set the DISPLAY environment variable in all MPI process' environments to the target display, perhaps something like this:

shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +
shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out

However, this technique is not generally suitable for unsecure environments (because it allows anyone to read and write to your display). A slightly more secure way is to only allow X connections from the nodes where your application will be running:

shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +compute1 +compute2 +compute3 +compute4
compute1 being added to access control list
compute2 being added to access control list
compute3 being added to access control list
compute4 being added to access control list
shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out

(assuming that the four nodes you are running on are compute1 through compute4).

Other methods are available, but they involve sophisticated X forwarding through mpirun and are generally more complicated than desirable.


125. Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?

Maybe. But probably not.

Open MPI provides fairly sophisticated stdin / stdout / stderr forwarding. However, it does not work well with curses, ncurses, readline, or other sophisticated I/O packages that generally require direct control of the terminal.

Every application and I/O library is different -- you should try to see if yours is supported. But chances are that it won't work.

Sorry. :-(


126. What other options are available to mpirun?

mpirun supports the "--help" option which provides a usage message and a summary of the options that it supports. It should be considered the definitive list of what options are provided.

Several notable options are:


127. How do I use the --hostfile option to mpirun?

The --hostfile option to mpirun takes a filename that lists hosts on which to launch MPI processes.

NOTE: The hosts listed in a hostfile have nothing to do with which network interfaces are used for MPI communication. They are only used to specify on which hosts to launch MPI processes.

Hostfiles my_hostfile are simple text files with hosts specified, one per line. Each host can also specify a default a maximum number of slots to be used on that host (i.e., the number of available processors on that host). Comments are also supported, and blank lines are ignored. For example:

# This is an example hostfile.  Comments begin with #
#
# The following node is a single processor machine:
foo.example.com

# The following node is a dual-processor machine:
bar.example.com slots=2

# The following node is a quad-processor machine, and we absolutely
# want to disallow over-subscribing it:
yow.example.com slots=4 max-slots=4

slot and max-slots are discussed more in this FAQ entry

Hostfiles works in two different ways:

  • Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as SLURM, PBS/Torque, SGE, etc.), the hosts provided by the hostfile must be in the already-provided host list. If the hostfile-specified nodes are not in the already-provided host list, mpirun will abort without launching anything.

    In this case, hostfiles act like an exclusionary filter -- they limit the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.

    For example, say that a scheduler job contains hosts node01 through node04. If you run:

    shell$ cat my_hosts
    node03
    shell$ mpirun -np 1 --hostfile my_hosts hostname
    

    This will run a single copy of hostname on the host node03. However, if you run:

    shell$ cat my_hosts
    node17
    shell$ mpirun -np 1 --hostfile my_hosts hostname
    

    This is an error (because node17 is not listed in my_hosts); mpirun will abort.

    Finally, note that in exclusionary mode, processes will only be executed on the hostfile-specified hosts, even if it causes oversubscription. For example:

    shell$ cat my_hosts
    node03
    shell$ mpirun -np 4 --hostfile my_hosts hostname
    

    This will launch 4 copies of hostname on host node03.

  • Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the --hostfile option will be used as the original and final host list.

    In this case, --hostfile acts as an inclusionary agent; all --hostfile-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):

    shell$ cat my_hosts
    node01.example.com
    node02.example.com
    node03.example.com
    shell$ mpirun -np 3 --hostfile my_hosts hostname
    

    This will launch a single copy of hostname on the hosts node01.example.com, node02.example.com, and node03.example.com.

Note, too, that --hostfile is essentially a per-application switch. Hence, if you specify multiple applications (as in an MPMD job), --hostfile can be specified multiple times:

shell$ cat hostfile_1
node01.example.com
shell$ cat hostfile_2
node02.example.com
shell$ mpirun -np 1 --hostfile hostfile_1 hostname : -np 1 --hostfile hostfile_2 uptime
node01.example.com
 06:11:45 up 1 day,  2:32,  0 users,  load average: 21.65, 20.85, 19.84

Notice that hostname was launched on node01.example.com and uptime was launched on host02.example.com.


128. How do I use the --host option to mpirun?

The --host option to mpirun takes a comma-delimited list of hosts on which to run. For example:

shell$ mpirun -np 3 --host a,b,c hostname

Will launch one copy of hostname on hosts a, b, and c.

NOTE: The hosts specified by the --host option have nothing to do with which network interfaces are used for MPI communication. They are only used to specify on which hosts to launch MPI processes.

--host works in two different ways:

  • Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as SLURM, PBS/Torque, SGE, etc.), the hosts provided by the --host option must be in the already-provided host list. If the --host-specified nodes are not in the already-provided host list, mpirun will abort without launching anything.

    In this case, the --host option acts like an exclusionary filter -- it limits the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.

    For example, say that the hostfile my_hosts contains the hosts node1 through node4. If you run:

    shell$ mpirun -np 1 --hostfile my_hosts --host node3 hostname
    

    This will run a single copy of hostname on the host node3. However, if you run:

    shell$ mpirun -np 1 --hostfile my_hosts --host node17 hostname
    

    This is an error (because node17 is not listed in my_hosts; mpirun will abort.

    Finally, note that in exclusionary mode, processes will only be executed on the --host-specified hosts, even if it causes oversubscription. For example:

    shell$ mpirun -np 4 --host a uptime
    

    This will launch 4 copies of uptime on host a.

  • Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the --host option will be used as the original and final host list.

    In this case, --host acts as an inclusionary agent; all --host-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):

    shell$ mpirun -np 3 --host a,b,c hostname
    

    This will launch a single copy of hostname on the hosts a, b, and c.

Note, too, that --host is essentially a per-application switch. Hence, if you specify multiple applications (as in an MPMD job), --host can be specified multiple times:

shell$ mpirun -np 1 --host a hostname : -np 1 --host b uptime

This will launch hostname on host a and uptime on host b.


129. How do I control how my processes are scheduled across nodes?

The short version is that if you are not oversubscribing your nodes (i.e., trying to run more processes than you have told Open MPI are available on that node), scheduling is pretty simple and occurs either on a by-slot or by-node round robin schedule. If you're oversubscribing, the issue gets much more complicated -- keep reading.

The more complete answer is: Open MPI schedules processes to nodes by asking two questions from each application on the mpirun command line:

  • How many processes should be launched?
  • Where should those processes be launched?

The "how many" question is directly answered with the -np switch to mpirun. The "where" question is a little more complicated, and depends on three factors:

  • The final node list (e.g., after --host exclusionary or inclusionary processing)
  • The scheduling policy (which applies to all applications in a single job)
  • The default and maximum number of slots on each host

As briefly mentioned in this FAQ entry, slots are Open MPI's representation of how many processors are available on a given host.

The default number of slots on any machine, if not explicitly specified, is 1 (e.g., if a host is listed in a hostfile by has no corresponding "slots" keyword). Schedulers (such as SLURM, PBS/Torque, SGE, etc.) automatically provide an accurate default slot count.

Max slot counts, however, are rarely specified by schedulers. The max slot count for each node will default to "infinite" if it is not provided (meaning that Open MPI will oversubscribe the node if you ask it to -- see more on oversubscribing in this FAQ entry).

Open MPI currently supports two scheduling policies: by slot and by node:

  • By slot: This is the default scheduling policy, but can also be explicitly requested by using either the --byslot option to mpirun or by setting the MCA parameter rmaps_base_schedule_policy to the string "slot".

    In this mode, Open MPI will schedule processes on a node until all of its default slots are exhausted before proceeding to the next node. In MPI terms, this means that Open MPI tries to maximize the number of adjacent ranks in MPI_COMM_WORLD on the same host without oversubscribing that host.

    For example:

    shell$ cat my-hosts
    node0 slots=2 max_slots=20
    node1 slots=2 max_slots=20
    shell$ mpirun --hostfile my-hosts -np 8 --byslot | sort
    Hello World I am rank 0 of 8 running on node0
    Hello World I am rank 1 of 8 running on node0
    Hello World I am rank 2 of 8 running on node1
    Hello World I am rank 3 of 8 running on node1
    Hello World I am rank 4 of 8 running on node0
    Hello World I am rank 5 of 8 running on node0
    Hello World I am rank 6 of 8 running on node1
    Hello World I am rank 7 of 8 running on node1
    

  • By node: This policy can be requested either by using the --bynode option to mpirun or by setting the MCA parameter rmaps_base_schedule_policy to the string "node".

    In this mode, Open MPI will schedule a single process on each node in a round-robin fashion (looping back to the beginning of the node list as necessary) until all processes have been scheduled. Nodes are skipped once their default slot counts are exhausted.

    For example:

    shell$ shell$ cat my-hosts
    node0 slots=2 max_slots=20
    node1 slots=2 max_slots=20
    shell$ mpirun --hostname my-hosts -np 8 --bynode hello | sort
    Hello World I am rank 0 of 8 running on node0
    Hello World I am rank 1 of 8 running on node1
    Hello World I am rank 2 of 8 running on node0
    Hello World I am rank 3 of 8 running on node1
    Hello World I am rank 4 of 8 running on node0
    Hello World I am rank 5 of 8 running on node1
    Hello World I am rank 6 of 8 running on node0
    Hello World I am rank 7 of 8 running on node1
    

In both policies, if the default slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will loop through the list of nodes again and try to schedule one more process to each node until all processes are scheduled. Nodes are skipped in this process if their maximum slot count is exhausted. If the maximum slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will abort without launching any processes.

NOTE: This is the scheduling policy in Open MPI because of a long historical precedent in LAM/MPI. However, the scheduling of processes to processors is a component in the RMAPS framework in Open MPI; it can be changed. If you don't like how this scheduling occurs, please let us know.


130. I'm not using a hostfile. How are slots calculated?

If you are using a supported resource manager, Open MPI will get the slot information directly from that entity. If you are using the --host parameter to mpirun, be aware that each instance of a hostname bumps up the internal slot count by one. For example:

shell$ mpirun --host node0,node0,node0,node0 ....

This tells Open MPI that host "node0" has a slot count of 4. This is very different than, for example:

shell$ mpirun -np 4 --host node0 a.out

This tells Open MPI that host "node0" has a slot count of 1 but you are running 4 processes on it. Specifically, Open MPI assumes that you are oversubscribing the node.


131. Can I run multiple parallel processes on a uniprocessor machine?

Yes.

But be very careful to ensure that Open MPI knows that you are oversubscibing your node! If Open MPI is unaware that you are oversubscribing a node, severe performance degredation can result.

See this FAQ entry for more details on oversubscription.


132. Can I oversubscribe nodes (run more processes than processors)?

Yes.

However, it is critical that Open MPI knows that you are oversubscribing the node, or severe performance degredation can result.

The short explanation is as follows: never specify a number of slots that is more than the available number of processors. For example, if you want to run 4 processes on a uniprocessor, then indicate that you only have 1 slot but want to run 4 processes. For example:

shell$ cat my-hostfile
localhost
shell$ mpirun -np 4 --hostfile my-hostfile a.out

Specifically: do NOT have a hostfile that contains "slots = 4" (because there is only one available processor).

Here's the full explanation:

Open MPI basically runs its message passing progression engine in two modes: aggressive and degraded.

For example, on a uniprocessor node:

shell$ cat my-hostfile
localhost slots=4
shell$ mpirun -np 4 --hostfile my-hostfile a.out

This would cause all 4 MPI processes to run in aggressive mode because Open MPI thinks that there are 4 available processors to use. This is actually a lie (there is only 1 processor -- not 4), and can cause extremely bad performance.


133. Can I force Agressive or Degraded performance modes?

Yes.

The MCA parameter mpi_yield_when_idle controls whether an MPI process runs in Aggressive or Degraded performance mode. Setting it to zero forces Aggressive mode; any other value forces Degraded mode (see this FAQ entry to see how to set MCA parameters).

Note that this value only affects the behavior of MPI processes when they are blocking in MPI library calls. It does not affect behavior of non-MPI processes, nor does it affect the behavior of a process that is not inside an MPI library call.

Open MPI normally sets this parameter automatically (see this FAQ entry for details). Users are cautioned against setting this parameter unless you are really, absoultely, positively sure of what you are doing.


134. How do I run with the TotalView parallel debugger?

Generally, you can run Open MPI processes with TotalView as follows:

shell$ mpirun --debug ...mpirun arguments...

Assuming that TotalView is the first supported parallel debugger in your path, Open MPI will autmoatically invoke the correct underlying command to run your MPI process in the TotalView debugger. Be sure to see this FAQ entry for details about what versions of Open MPI and TotalView are compatible.

For reference, this underlying command form is the following:

shell$ totalview mpirun -a ...mpirun arguments...

So if you wanted to run a 4-process MPI job of your a.out executable, it would look like this:

shell$ totalview mpirun -a -np 4 a.out

Alternatively, Open MPI's mpirun offers the "-tv" convenience option which does the same thing as TotalView's "-a" syntax. For example:

shell$ mpirun -tv -np 4 a.out

Note that by default, TotalView will stop deep in the machine code of mpirun itself, which is not what most users want. It is possible to get TotalView to recognize that mpirun is simply a "starter" program and should be (effectively) ignored. Specifically, TotalView can be configured to skip mpirun (and mpiexec and orterun) and jump right into your MPI application. This can be accomplished by placing some startup instructions in a TotalView-specific file named $HOME/.tvdrc.

Open MPI includes a sample TotalView startup file that performs this function (see etc/openmpi-totalview.tcl in Open MPI distribution tarballs; it is also installed, by default, to $prefix/etc/openmpi-totalview.tcl in the Open MPI installation). This file can be either copied to $HOME/.tvdrc or sourced from the $HOME/.tvdrc file. For example, placing the following line in your $HOME/.tvdrc (replacing /path/to/openmpi/installation with the proper directory name, of course) will use the Open MPI-provided startup file:

shell$ source /path/to/openmpi/installation/etc/openmpi-totalview.tcl


135. How do I run with the DDT parallel debugger?

If you've used DDT at least once before (to use the configuration wizard to setup support for Open MPI), you can start it on the command line with:

shell$ mpirun --debug ...mpirun arguments...

Assuming that you are using Open MPI v1.2.4 or later, and assuming that DDT is the first supported parallel debugger in your path, Open MPI will autmoatically invoke the correct underlying command to run your MPI process in the DDT debugger. For reference (or if you are using an earlier version of Open MPI), this underlying command form is the following:

shell$ ddt -n {nprocs} -start {exe-name}

Note that passing arbitrary arguments to Open MPI's mpirun is not supported with the DDT debugger.

You can also attach to already-running proceses with either of the following two syntaxes:

shell$ ddt -attach {hostname1:pid} [{hostname2:pid} ...] {exec-name}
# Or
shell$ ddt -attach-file {filename of newline separated hostname:pid pairs} {exec-name}

DDT can even be configured to operate with cluster/resource schedulers such that it can run on a local workstation, submit your MPI job via the scheduler, and then attach to the MPI job when it starts.

See the official DDT documentation for more details.


136. What launchers are available?

The documentation contained in the Open MPI tarball will have the most up-to-date information, but as of v1.0, Open MPI supports:

  • BProc versions 3 and 4 (discontinued starting with OMPI v1.3)
  • Sun Grid Engine (SGE), and the open source Grid Engine (support first introduced in Open MPI v1.2)
  • PBS Pro, Torque, and Open PBS
  • LoadLeveler scheduler (full support since 1.1.1)
  • rsh / ssh
  • SLURM
  • LSF/li>
  • XGrid (discontinued starting with OMPI 1.4)
  • Yod (Cray XT-3 and XT-4)


137. How do I specify to the rsh launcher to use rsh or ssh?

See this FAQ entry.


138. How do I run with the SLURM and PBS/Torque launchers?

If support for these systems are included in your Open MPI installation (which you can check with the ompi_info command -- look for components named "slurm" and/or "tm"), Open MPI will automatically detect when it is running inside such jobs and will just "do the Right Thing."

See this FAQ entry for a description of how to run jobs in SLURM; see this FAQ entry for a description of how to run jobs in PBS/Torque.


139. Can I suspend and resume my job?

A new feature was added into Open MPI 1.3.1 that supports suspend/resume of an MPI job. To suspend the job, you send a SIGTSTP (not SIGSTOP) signal to mpirun. mpirun will catch this signal and forward it to the a.outs as a SIGSTOP signal. To resume the job, you send a SIGCONT signal to mpirun which will be caught and forwarded to the a.outs.

By default, this feature is not enabled. This means that both the SIGTSTP and SIGCONT signals will simply be consumed by the mpirun process. To have them forwarded, you have to run the job with --mca orte_forward_job_control 1. Here is an example on Solaris.

shell$ mpirun -mca orte_forward_job_control 1 -np 2 a.out

In another window, we suspend and continue the job.

shell$ shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15305 rolfv     158M   22M cpu1     0    0   0:00:21 5.9% a.out/1
 15303 rolfv     158M   22M cpu2     0    0   0:00:21 5.9% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1
shell$ kill -TSTP 15301
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15303 rolfv     158M   22M stop    30    0   0:01:44  21% a.out/1
 15305 rolfv     158M   22M stop    20    0   0:01:44  21% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1
shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15305 rolfv     158M   22M cpu1     0    0   0:02:06  17% a.out/1
 15303 rolfv     158M   22M cpu3     0    0   0:02:06  17% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1
shell$ kill -CONT 15301
shell$ prstat -p 15301,15303,15305

Note that all this does is stop the a.outs. It does not, for example, free any pinned memory when the job is in the suspended state.

To get this to work under the SGE environment, you have to change the suspend_method entry in the queue. It has to be set to SIGTSTP. Here is an example of what a queue should look like.

shell$ qconf -sq all.q
qname                 all.q
[...snip...]
starter_method        NONE
suspend_method        SIGTSTP
resume_method         NONE 

Note that if you need to suspend other types of jobs with SIGSTOP (instead of SIGTSTP) in this queue then you need to provide a script that can implement the correct signals for each job type.


140. How do I run with LoadLeveler?

If support for LoadLeveler is included in your Open MPI installation (which you can check with the ompi_info command -- look for components named "loadleveler"), Open MPI will automatically detect when it is running inside such jobs and will just "do the Right Thing."

Specifically, if you execute an mpirun command in a LoadLeveler job, it will automatically determine what nodes and how many slots on each node have been allocated to the current job. There is no need to specify what nodes to run on. Open MPI will then attempt to launch the job using whatever resource is available (on Linux rsh/ssh is used).

For example:

shell$ cat job
#@ output  = job.out
#@ error   = job.err
#@ job_type = parallel
#@ node = 3
#@ tasks_per_node = 4
mpirun a.out
shell$ llsubmit job

This will run 4 MPI process per node on the 3 nodes which were allocated by LoadLeveler for this job.

For users of Open MPI 1.1 series: In version 1.1.0 there exists a problem which will make it so that Open MPI will not be able to determine what nodes are available to it if the job has more than 128 tasks. In the 1.1.x series starting with version 1.1.1., this can be worked around by passing "-mca ras_loadleveler_priority 110" to mpirun. Version 1.2 and above work without any additional flags.


141. How do I load libmpi at runtime?

If you want to load a the shared library libmpi explicitly at runtime either by using dlopen() from C/C ++ or something like the ctypes package from Python, some extra care is required. The default configuration of Open MPI uses dlopen() internally to load its support components. These components rely on symbols available in libmpi. In order to make the symbols in libmpi available to the components loaded by Open MPI at runtime, libmpi must be loaded with the RTLD_GLOBAL option.

In C/C++, this option is specified as the second parameter to dlopen(). When using ctypes with Python, this can be done with the second (optional) parameter to CDLL(). For example (shown below in Mac OS X, where Open MPI's shared library name ends in ".dylib"; other operating systems use other suffixes, such as ".so")

  from ctypes import *
  mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL)
  f = pythonapi.Py_GetArgcArgv
  argc = c_int()
  argv = POINTER(c_char_p)()
  f(byref(argc), byref(argv))
  mpi.MPI_Init(byref(argc), byref(argv))
  mpi.MPI_Finalize()

Other scripting languages should have similar options when dynamically loading shared libraries.


142. What MPI environmental variables exist?

Beginning with the 1.3 release, Open MPI provides the following environmental variables that will be defined on every MPI process:

  • OMPI_COMM_WORLD_SIZE - the number of processes in this process' MPI Comm_World
  • OMPI_COMM_WORLD_RANK - the MPI rank of this process
  • OMPI_COMM_WORLD_LOCAL_RANK - the relative rank of this process on this node within its job. For example, if four processes in a job share a node, they will each be given a local rank ranging from 0 to 3.
  • OMPI_UNIVERSE_SIZE - the number of process slots allocated to this job. Note that this may be different than the number of processes in the job.
  • OMPI_COMM_WORLD_LOCAL_SIZE - the number of ranks from this job that are running on this node.
  • OMPI_COMM_WORLD_NODE_RANK - the relative rank of this process on this node looking across ALL jobs.

Open MPI guarantees that these variables will remain stable throughout future releases


143. How do I get my MPI job to wireup its MPI connections right away?

By default, Open MPI opens MPI connections between processes in a "lazy" fashion - i.e., the connections are only opened when the MPI process actually attempts to send a message to another process for the first time. This is done since (a) Open MPI has no idea what connections an application process will really use, and (b) creating the connections takes time. Once the connection is established, it remains "connected" until one of the two connected processes terminates, so the creation time cost is paid only once.

Applications that require a fully connected topology, however, can see improved startup time if they automatically "pre-connect" all their processes during MPI_Init. Accordingly, Open MPI provides the MCA parameter "mpi_preconnect_mpi" which directs Open MPI to establish a "mostly" connected topology during MPI_Init (note that this MCA parameter used to be named "mpi_preconnect_all" prior to Open MPI v1.5; in v1.5, it was deprecated and replaced with "mpi_preconnect_mpi"). This is accomplished in a somewhat scalable fashion to help minimize startup time.

Users can set this parameter in two ways:

  • in the environment as OMPI_MCA_mpi_preconnect_mpi=1
  • on the cmd line as mpirun -mca mpi_preconnect_mpi 1

See this FAQ entry for more details on how to set MCA parameters.


144. What kind of CUDA support exists in Open MPI?

Since Open MPI 1.7.0, there is support for sending and receiving CUDA device memory directly. Prior to this support, the programmer would first have to stage the data in host memory prior to making the MPI calls. Now, the Open MPI library will automatically detect that the pointer being passed in is a CUDA device memory pointer and do the right thing. This is referred to as CUDA-aware support.

The use of device pointers is supported in all of the send and receive APIs as well as most of the collective APIs. Neither the collective reduction APIs nor the one-sided APIs are supported. Here is the list of APIs that currently support sending and receiving CUDA device memory.

MPI_Send, MPI_Bsend, MPI_Ssend, MPI_Rsend, MPI_Isend, MPI_Ibsend, MPI_Issend, MPI_Irsend, MPI_Send_init, MPI_Bsend_init, MPI_Ssend_init, MPI_Rsend_init, MPI_Recv, MPI_Irecv, MPI_Recv_init, MPI_Sendrecv, MPI_Bcast, MPI_Gather, MPI_Gatherv, MPI_Allgather, MPI_Allgatherv, MPI_Alltoall, MPI_Alltoallv, MPI_Scatter, MPI_Scatterv

Open MPI depends on various features of CUDA 4.0, so one needs to have at least the CUDA 4.0 driver and toolkit. The new features of interest are the Unified Virtual Addressing (UVA) so that all pointers within a program have unique addresses. In addition, there is a new API that allows one to determine if a pointer is a CUDA device pointer or host memory pointer. This API is used by the library to decide what needs to be done with each buffer. In addition, CUDA 4.1 also provides the ability to register host memory with the CUDA driver which can improve performance. CUDA 4.1 also added CUDA IPC support for fast communication between GPUs on the same node.

Note that derived datatypes, both contiguous and non-contiguous, are supported. However, the non-contiguous datatypes currently have high overhead because of the many calls to cuMemcpy to copy all the pieces of the buffer into the intermediate buffer.

CUDA-aware support is only available in the sm, smcuda, tcp, and openib BTLs. The smcuda BTL is an optimized version of the sm BTL that takes advantage of the CUDA IPC support for fast GPU transfers. Much of the other optimizations are built in to the openib BTL.

There is no CUDA-aware support in any of the MTLs.

Open MPI 1.7.0, Open MPI 1.7.1, Open MPI 1.7.2

  • Basic GPU direct support.
  • Support for CUDA IPC between GPUs on a node, but would get error if the GPUs did not support CUDA IPC.

Open MPI 1.7.3 New Features

  • Support for asynchronous copies of larger GPU buffers over the openib BTL.
  • Dynamically loads the libcuda.so library so you can configure with CUDA-aware support, but run on machines that do not have CUDA installed.

Open MPI 1.7.4 New Features

  • Removed synchronize point in CUDA IPC when running with CUDA 6.0 or later.
  • Utilizes GPU Direct RDMA if it is available. Requires CUDA 6.0 or later.
  • Dynamically enable CUDA IPC support between GPUs and back off to copy through host memory if it is not available.

For best results, it is recommended that you use Open MPI 1.7.3 or later.

Additional Information about CUDA-aware support

Here are some relevant mca parameters to extract extra information if you are having issues. For Open MPI 1.7.3 and later, you can see if the library was built with CUDA-aware support.

 > ./ompi_info --parsable -l 9 -all | grep mpi_built_with_cuda_support:value
 mca:mpi:base:param:mpi_built_with_cuda_support:value:true

To get some extra information, there are some verbose flags. The opal_cuda_verbose has only one level of verbosity. (Works on all versions)

--mca opal_cuda_verbose 10

This mpi_common_cuda_verbose flag provides additional information about CUDA-aware related activities. This can be set to a variety of different values. There is really no need to use these unless you have strange problems. (Works on all versions)

--mca mpi_common_cuda_verbose 10
--mca mpi_common_cuda_verbose 20
--mca mpi_common_cuda_verbose 100

There are three new MCA parameters introduced with Open MPI 1.7.4 related to the use of CUDA IPC. By default, CUDA IPC is used where possible. But the user can now turn it off if they want.

--mca btl_smcuda_use_cuda_ipc 0

In addition, it is assumed that CUDA IPC is possible when running on the same GPU and this is typically true. However, there is the ability to turn it off.

--mca btl_smcuda_use_cuda_ipc_same_gpu 0

Lastly, to get some insight into whether CUDA IPC is being used, you can turn on some verbosity that shows whether CUDA IPC gets enabled between two GPUs.

--mca btl_smcuda_cuda_ipc_verbose 100

GPU Direct RDMA Information

Open MPI 1.7.4 has added some support to take advantage of GPU Direct RDMA on Mellanox cards. However, the supporting driver has not been released yet, so these features cannot be used yet. Note that to get GPU Direct RDMA support, you also need to configure your Open MPI library with CUDA 6.0.

To see if you have GPU Direct RDMA compiled into your library, you can check like this:

> ompi_info --all | grep btl_openib_have_cuda_gdr
   MCA btl: informational "btl_openib_have_cuda_gdr" (current value: "true", data source: default, level: 4 tuner/basic, type: bool)

To see if your OFED stack has GPU Direct RDMA support, you can check like this.

> ompi_info -all | grep btl_openib_have_driver_gdr
   MCA btl: informational "btl_openib_have_driver_gdr" (current value: "true", data source: default, level: 4 tuner/basic, type: bool)

To run with GPU Direct RDMA support, you have to enable it as it is off by default.

--mca btl_openib_want_cuda_gdr 1

GPU Direct RDMA Implementation Details

With GPU Direct RDMA support selected, the eager protocol is unused. This is done to avoid the penalty of copying unexpected GPU messages into host memory. Instead, a rendezvous protocol is used where the sender and receiver both register their GPU buffers and make use of GPU Direct RDMA support to transfer the data. This is done for all messages that are less than 30,000 bytes in size. For larger messages, the openib BTL switches to using pipelined buffers as that has better performance at larger messages. So, by default, with GPU Direct RDMA enabled, the underlying protocol usage is like this:

0      < message size < 30,000      GPU Direct RDMA
30,000 < message size < infinity    Asynchronous copies through host memory

You can adjust the point where we switch to asynchronous copes with the --mca btl_openib_cuda_rdma_limit value. For example, if you want to increase the switchover point to 100,000 bytes, then set it like this.

--mca btl_openib_cuda_rdma_limit 100000

By default, if we have GPU Direct RDMA, we use it for 1 byte messages on up to the btl_openib_cuda_rdma_limit value. However, you could use the eager protocol for the smallest messages by setting --mca btl_openib_cuda_eager_limit value. Note: The btl_openib_cuda_eager_limit value includes some overhead so you cannot just set it to the payload value. It has to be set to the payload plus the extra upper layer extra bytes. Currently, in Open MPI 1.7.4, this overhead is 44 bytes, so that has to be the minimum value. In the table below we are just referring to the size of the payload.

This table tries to show how the various runtime parameters affect what protocols are used in a GPU Direct RDMA.

Message Size Limits Protocol
0 < message size < btl_openib_cuda_eager_limit (default=0) eager protocol (not used by default)
btl_openib_cuda_eager_limit (default=0) < message size < btl_openib_cuda_rdma_limit (default=30,000) rendezvous protocol utilizing GPU Direct RDMA
btl_openib_cuda_rdma_limit (default=30,000) < message size < infinity pipelined transfers of size 128K through host memory

Performance Note The cost of registering the GPU memory with the Mellanox driver is expensive so it is best to reuse the same GPU buffer for communication.

NUMA Node Issues When running on a node that has multiple GPUs, you may want to select the GPU that is closest to the process you are running on. One way to do this is to make use of the hwloc library. Following is a code snippet that can be used in your application to select a GPU that is close. It will determine which CPU it is running on and then look for the closest GPU. There could be multiple GPUs that are the same distance away. This is dependent on having hwloc somewhere on your system.

/**
 * Test program to show the use of hwloc to select the GPU closest to the CPU
 * that the MPI program is running on.  Note that this works even without
 * any libpciacces or libpci support as it keys of the NVIDIA vendor ID.
 * There may be other ways to implement this but this is one way.
 * January 10, 2014
 */
#include <assert.h>
#include <stdio.h>
#include "cuda.h"
#include "mpi.h"
#include "hwloc.h"

#define ABORT_ON_ERROR(func)                          \
  { CUresult res;                                     \
    res = func;                                       \
    if (CUDA_SUCCESS != res) {                        \
        printf("%s returned error=%d\n", #func, res); \
        abort();                                      \
    }                                                 \
  }                             
static hwloc_topology_t topology = NULL;
static int gpuIndex = 0;
static hwloc_obj_t gpus[16] = {0};

/**
 * This function searches for all the GPUs that are hanging off a NUMA
 * node.  It walks through each of the PCI devices and looks for ones
 * with the NVIDIA vendor ID.  It then stores them into an array.
 * Note that there can be more than one GPU on the NUMA node.
 */

static void find_gpus(hwloc_topology_t topology, hwloc_obj_t parent, hwloc_obj_t child) {
    hwloc_obj_t pcidev;
    pcidev = hwloc_get_next_child(topology, parent, child);
    if (NULL == pcidev) {
        return;
    } else if (0 != pcidev->arity) {
        /* This device has children so need to look recursively at them */
        find_gpus(topology, pcidev, NULL);
        find_gpus(topology, parent, pcidev);
    } else {
        if (pcidev->attr->pcidev.vendor_id == 0x10de) {
            gpus[gpuIndex++] = pcidev;
        }
        find_gpus(topology, parent, pcidev);
    }
}
int main(int argc, char *argv[])
{
    int rank, retval, length;
    char procname[MPI_MAX_PROCESSOR_NAME+1];
    const unsigned long flags = HWLOC_TOPOLOGY_FLAG_IO_DEVICES | HWLOC_TOPOLOGY_FLAG_IO_BRIDGES;
    hwloc_cpuset_t newset;
    hwloc_obj_t node, bridge;
    char pciBusId[16];
    CUdevice dev;
    char devName[256];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (MPI_SUCCESS != MPI_Get_processor_name(procname, &length)) {
        strcpy(procname, "unknown");
    }

    /* Now decide which GPU to pick.  This requires hwloc to work properly.
     * We first see which CPU we are bound to, then try and find a GPU nearby.
     */
    retval = hwloc_topology_init(&topology);
    assert(retval == 0);
    retval = hwloc_topology_set_flags(topology, flags);
    assert(retval == 0);
    retval = hwloc_topology_load(topology);
    assert(retval == 0);
    newset = hwloc_bitmap_alloc();
    retval = hwloc_get_last_cpu_location(topology, newset, 0);
    assert(retval == 0);

    /* Get the object that contains the cpuset */
    node = hwloc_get_first_largest_obj_inside_cpuset(topology, newset);

    /* Climb up from that object until we find the HWLOC_OBJ_NODE */
    while (node->type != HWLOC_OBJ_NODE) {
        node = node->parent;
    }

    /* Now look for the HWLOC_OBJ_BRIDGE.  All PCI busses hanging off the
     * node will have one of these */
    bridge = hwloc_get_next_child(topology, node, NULL);
    while (bridge->type != HWLOC_OBJ_BRIDGE) {
        bridge = hwloc_get_next_child(topology, node, bridge);
    }

    /* Now find all the GPUs on this NUMA node and put them into an array */
    find_gpus(topology, bridge, NULL);

    ABORT_ON_ERROR(cuInit(0));
    /* Now select the first GPU that we find */
    if (gpus[0] == 0) {
        printf("No GPU found\n");
        exit(1);
    } else {
        sprintf(pciBusId, "%.2x:%.2x:%.2x.%x", gpus[0]->attr->pcidev.domain, gpus[0]->attr->pcidev.bus,
        gpus[0]->attr->pcidev.dev, gpus[0]->attr->pcidev.func);
        ABORT_ON_ERROR(cuDeviceGetByPCIBusId(&dev, pciBusId));
        ABORT_ON_ERROR(cuDeviceGetName(devName, 256, dev));
        printf("rank=%d (%s): Selected GPU=%s, name=%s\n", rank, procname, pciBusId, devName);
    }

    MPI_Finalize();
    return 0;
}

See this FAQ entry for detals on how to configure the CUDA support into the library.


145. Open MPI tells me that it fails to load components with a "file not found" error -- but the file is there! Why does it say this?

Open MPI loads a lot of plugins at run time. It opens its plugins via the excellent GNU Libtool libltdl portability library. If a plugin fails to load, Open MPI queries libltdl to get a printable string indicating why the plugin failed to load.

Unfortunately, there is a well-known bug in libltdl that may cause a "file not found" error message to be displayed, even when the file is found. The "file not found" error usually masks the real, underlying cause of the problem. For example:

mca: base: component_find: unable to open /opt/openmpi/mca_ras_dash_host: file not found (ignored)

Note that Open MPI put in a libltdl workaround starting with version 1.5. This workaround should print the real reason the plugin failed to load instead of the erroneous "file not found" message.

There are two common underlying causes why a plugin fails to load:

  1. The plugin is for a different version of Open MPI. This FAQ entry has more information about this case.
  2. The plugin cannot find shared libraries that it requires. For example, if the openib plugin fails to load, ensure that libibverbs.so can be found by the linker at run time (e.g., check the value of your LD_LIBRARY_PATH environment variable). The same is true for any other plugin that have shared library dependencies (e.g., the mx BTL and MTL plugins need to be able to find the libmyriexpress.so shared library at run time).


146. I see strange messages about missing symbols in my application; what do these mean?

Open MPI loads a lot of plugins at run time. It opens its plugins via the excellent GNU Libtool libltdl portability library. Sometimes a plugin can fail to load because it can't resolve all the symbols that it needs. There are a few reasons why this can happen.

  • The plugin is for a different version of Open MPI. See this FAQ entry for an explanation of how Open MPI might try to open the "wrong" plugins.
  • An application is trying to manually dynamically open libmpi in a private symbol space. For example, if an application is not linked against libmpi, but rather calls something like this:

    /* This is a Linux example -- the issue is similar/the same on other
       operating systems */
    handle = dlopen("libmpi.so", RTLD_NOW | RTLD_LOCAL);
    

    This is due to some deep run time linker voodoo -- it is discussed towards the end of this post to the Open MPI developer's list. Briefly, the issue is this:

    1. The dynamic library libmpi is opened in a "local" symbol space.
    2. MPI_INIT is invoked, which tries to open Open MPI's plugins.
    3. Open MPI's plugins rely on symbols in libmpi (and other Open MPI support libraries); these symbols must be resolved when the plugin is loaded.
    4. However, since libmpi was opened in a "local" symbol space, its symbols are not available to the plugins that it opens.
    5. Hence, the plugin fails to load because it can't resolve all of its symbols, and displays a warning message to that effect.

    The ultimate fix for this issue is a bit bigger than Open MPI, unfortunately -- it's a POSIX issue (as briefly described in the devel posting, above).

    However, there are several common workarounds:

    • Dynamically open libmpi in a public / global symbol scope -- not a private / local scope. This will enable libmpi's symbols to be available for resolution when Open MPI dynamically opens its plugins.
    • If libmpi is opened as part of some underlying framework where it is not possible to change the private / local scope to a public / global scope, then dynamically open libmpi in a public / global scope before invoking the underlying framework. This sounds a little gross (and it is), but at least the run-time linker is smart enough to not load libmpi twice -- but it does keeps libmpi in a public scope.
    • Use the --disable-dlopen or --disable-mca-dso options to Open MPI's configure script (see this FAQ entry for more details on these options). These options slurp all of Open MPI's plugins up in to libmpi -- meaning that the plugins physically reside in libmpi and will not be dynamically opened at run time.
    • Build Open MPI as a static library by configuring Open MPI with --disable-shared and --enable-static. This has the same effect as --disable-dlopen, but it also makes libmpi.a (as opposed to a shared library).


147. What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it?

You may wonder why you see this warning message (put here verbatim so that it becomes web-searchable):

mca_pml_teg.so:undefined symbol:mca_ptl_base_modules_initialized

This happens when you upgrade to Open MPI v1.1 (or later) over an old installation of Open MPI v1.0.x without previously uninstalling v1.0.x. There are fairly uninteresting reasons why this problem occurs; the simplest, safest solution is to uninstall version 1.0.x and then re-install your newer version. For example:

shell# cd /path/to/openmpi-1.0
shell# make uninstall
[... lots of output ...]
shell# cd /path/to/openmpi-1.1
shell# make install

The above example shows changing into the Open MPI 1.1 directory to re-install, but the same concept applies to any version after Open MPI version 1.0.x.

Note that this problem is fairly specific to installing / upgrading Open MPI from the source tarball. Pre-packaged installers (e.g., RPM) typically do not incur this problem.


148. Can I build shared libraries on AIX with the IBM XL compilers?

Short answer: in older versions of Open MPI, maybe.

Add "LDFLAGS=-Wl,-brtl" to your configure command line:

shell$ ./configure LDFLAGS=-Wl,-brtl ...

This enables "runtimelinking", which will make GNU Libtool name the libraries properly (i.e., *.so). More importantly, runtimelinking will cause the runtime linker to behave more or less like an ELF linker would (with respect to symbol resolution).

Future versions of OMPI may not require this flag (and "runtimelinking" on AIX).

NOTE: As of OMPI v1.2, AIX is no longer supported.


149. Why am I getting a seg fault in libopal?

It is likely that you did not get a segv in libopal; it is likely that you are seeing a message like this (with OMPI v1.0 and v1.1):

[0] func:/opt/ompi/lib/libopal.so.0 [0x2a958de8a7]

or something like this (OMPI v1.2 and beyond; Linux output shown below -- looks slightly different on other OS's):

[0] func:/opt/ompi/lib/libopal.so.0(opal_backtrace_print+0x2b) [0x2a958de8a7]

This is actually the function that is printing out the stack trace message; it is not the function that caused the segv itself. The function that caused the problem will be a few below this. Future versions of OMPI will simply not display this libopal function in the segv reporting to avoid confusion.

Let's provide a concrete example. Take the following trivial MPI program that is guaranteed to cause a seg fault in MPI_COMM_WORLD rank 1:

shell$ cat segv.c
#include 

int main(int argc, char **argv)
{
    int rank;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (rank == 1) {
        char *d = 0;
        /* This will cause a seg fault */
        *d = 3;
    }

    MPI_Finalize();
    return 0;
}

Running this code, you'll see something similar to the following:

shell$ mpicc segv.c -o segv -g
shell$ mpirun -np 2 --mca btl tcp,self segv
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0] func:/opt/ompi/lib/libopal.so.0(opal_backtrace_print+0x2b) [0x2a958de8a7]
[1] func:/opt/ompi/lib/libopal.so.0 [0x2a958dd2b7]
[2] func:/lib64/tls/libpthread.so.0 [0x3be410c320]
[3] func:segv(main+0x3c) [0x400894]
[4] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3be361c4bb]
[5] func:segv [0x4007ca]
*** End of error message ***

The real error was back up in main, which is #3 on the stack trace. But Open MPI's stack-tracing function (opal_backtrace_print, in this case) is what is displayed as #0, so it's an easy mistake to assume that libopal is the culprit.


150. Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler?

Early versions of the Intel 9.1 C++ compiler series had problems with the Open MPI C++ bindings. Even trivial MPI applications that used the C++ MPI bindings could incur process failures (such as segmentation violations) or generate MPI-level errors complaining about invalid parameters.

Intel released a new version of their 9.1 series C++ compiler on October 5, 2006 (build 44) that seems to solve all of these issues. The Open MPI team recommends that all users needing the C++ MPI API upgrade to this version (or later) if possible. Since the problems are with the compiler, there is little that Open MPI can do to work around the issue; upgrading the compiler seems to be the only solution.


151. All my MPI applications segv! Why? (Intel Linux 12.1 compiler)

Users have reported on the Open MPI users mailing list multiple times that when they compile Open MPI with the Intel 12.1 compiler suite, Open MPI tools (such as the wrapper compilers, including mpicc) and MPI applications will seg fault immediately.

As far as we know, this affects both Open MPI v1.4.4 (and later) and v1.5.4 (and later).

Here's one example of a user reporting this to the Open MPI User's list.

The cause of the problem has turned out to be a bug in early versions of the Intel Linux 12.1 compiler series itself. If you upgrade your Intel compiler to the latest version of the Intel 12.1 compiler suite and rebuild Open MPI, the problem will go away.


152. Why can't I attach my parallel debugger (TotalView, DDT, fx2, etc.) to parallel jobs?

As noted in this FAQ entry, Open MPI supports parallel debuggers that utilize the TotalView API for parallel process attaching. However, it can sometimes fail of Open MPI is not installed correctly. Symptoms of this failure typically involve having the debugger hang (or crash) when attempting to attach to a parallel MPI application.

Parallel debuggers may rely on having Open MPI's mpirun program being compiled without optimization. Open MPI's configure and build process therefore attempts to identify optimization flags and remove them when compiling mpirun, but it does not have knowledge of all optimization flags for all compilers. Hence, if you specify some esoteric optimization flags to Open MPI's configure script, some optimization flags may slip through the process and create an mpirun that cannot be read by TotalView and other parallel debuggers.

If you run into this problem, you can manully build mpirun without optimization flags. Go into the tree where you built Open MPI:

shell$ cd /path/to/openmpi/build/tree
shell$ cd orte/tools/orterun
shell$ make clean
[...output not shown...]
shell$ make all CFLAGS=-g
[...output not shown...]
shell$

This will build mpirun (also known as orterun) with just the "-g" flag. Once this completes, run make install, also from within the orte/tools/orterun directory, and possibly as root depending on where you installed Open MPI. Using this new orterun ([mpirun]), your parallel debugger should be able to attach to MPI jobs.

Additionally, a user reported to us that setting some TotalView flags may be helpful with attaching. The user specifically cited the Open MPI v1.3 series compiled with the Intel 11 compilers and TotalView 8.6, but it may also be helpful for other versions, too:

shell$ export with_tv_debug_flags="-O0 -g -fno-inline-functions"


153. When launching large MPI jobs, I see messages like: mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying

This is a known issue in the Open MPI v1.2. series. Try the following:

  1. If you are using Linux-based systems, increase some of the limits on the node where mpirun is invoked (you must have administrator/root privlidges to increase these limits):

    # The default is 128; increase it to 10,000
    shell# echo 10000 > /proc/sys/net/core/somaxconn
    
    # The default is 1,000; increase it to 100,000
    shell# echo 100000 > /proc/sys/net/core/netdev_max_backlog
    

  2. Set the oob_tcp_listen_mode MCA parameter to the string value listen_thread. This enables Open MPI's mpirun to respond much more quickly to incoming TCP connections during job launch, for example:

    shell$ mpirun --mca oob_tcp_listen_mode listen_thread -np 1024 my_mpi_program
    

    See
    this FAQ entry for more details on how to set MCA parameters.


154. How do I find out what MCA parameters are being seen/used by my job?

As described elsewhere, MCA parameters are the "life's blood" of Open MPI. MCA parameters are used to control both detailed and large-scale behavior of Open MPI and are present throughout the code base.

This raises an important question: since MCA parameters can be set from a file, the environment, the command line, and even internally within Open MPI, how do I actually know what MCA params my job is seeing, and their value?

One way, of course, is to use the ompi_info command, which is documented elsewhere (you can use "man ompi_info", or "ompi_info --help" to get more info on this command). However, this still doesn't fully answer the question since ompi_info isn't an MPI process.

To help relieve this problem, Open MPI (starting with the 1.3 release) provides the MCA parameter mpi_show_mca_params that directs the rank=0 MPI process to report the name of MCA parameters, their current value as seen by that process, and the source that set that value. The parameter can take several values that define which MCA parameters to report:

  • all: report all MCA params. Note that this typically generates a rather long list of parameters since it includes all of the default parameters defined inside Open MPI
  • default: MCA params that are at their default settings - i.e., all MCA params that are at the values set as default within Open MPI
  • file: MCA params that had their value set by a file
  • api: MCA params set using Open MPI's internal APIs, perhaps to override an incompatible set of conditions specified by the user
  • enviro: MCA params that obtained their value either from the local environment or the command line. Open MPI treats environmental and command line parameters as equivalent, so there currently is no way to separate these two sources

These options can be combined in any order by separating them with commas.

Here is an example of the output generated by this parameter:

$ mpirun -mca grpcomm basic -mca mpi_show_mca_params enviro ./hello
ess=env (environment or cmdline)
orte_ess_jobid=1016725505 (environment or cmdline)
orte_ess_vpid=0 (environment or cmdline)
grpcomm=basic (environment or cmdline)
mpi_yield_when_idle=0 (environment or cmdline)
mpi_show_mca_params=enviro (environment or cmdline)
Hello, World, I am 0 of 1

Note that several MCA parameters set by Open MPI itself for internal uses are displayed in addition to the ones actually set by the user.

Since the output from this option can be long, and since it can be helpful to have a more permanent record of the MCA parameters used for a job, a companion MCA parameter mpi_show_mca_params_file is provided. If mpi_show_mca_params is also set, the output listing of MCA parameters will be directed into the specified file instead of being printed to stdout.


155. How do I debug Open MPI processes in parallel?

This is a difficult question. Debugging in serial can be tricky: errors, uninitialized variables, stack smashing, ... etc. Debugging in parallel adds multiple different dimensions to this problem: a greater propensity for race conditions, asynchronous events, and the general difficulty of trying to understand N processes simultaneously executing -- the problem becomes quite formidable.

This FAQ section does not provide any definition solutions to debugging in parallel. At best, it shows some general techniques and a few specific examples that may be helpful to your situation.

But there are various controls within Open MPI that can help with debugging. These are probably the most valuable entries in this FAQ section.


156. What tools are available for debugging in parallel?

There are two main categories of tools that can aid in parallel debugging:

  • Debuggers: Both serial and parallel debuggers are useful. Serial debuggers are what most programmers are used to (e.g., gdb), while parallel debuggers can attach to all the individual processes in an MPI job simultaneously, treating the MPI application as a single entity. This can be an extremely powerful abstraction, allowing the user to control every aspect of the MPI job, manually replicate race conditions, etc.
  • Profilers: Tools that analyze your usage of MPI and display statistcs and meta information about your application's run. Some tools present the information "live" (as it occurs), while others collect the information and display it in a post mortem analysis.

Both freeware and commercial solutions are available for each kind of tool.


157. How do I run with parallel debuggers?

See these FAQ entries:


158. What controls does Open MPI have that aid in debugging?

Open MPI has a series of MCA parameters for the MPI layer itself that are designed to help with debugging. These parameters can be can be set in the usual ways. MPI-level MCA parameters can be displayed by invoking the following command:

shell$ ompi_info --param mpi all

Here is a summary of the debugging parameters for the MPI layer:

  • mpi_param_check: If set to true (any positive value), and when Open MPI is compiled with parameter checking enabled (the default), the parameters to each MPI function can be passed through a series of correctness checks. Problems such as passing illegal values (e.g., NULL or MPI_DATATYPE_NULL or other "bad" values) will be discovered at run time and an MPI exception will be invoked (the default of which is to print a short message and abort the entire MPI job). If set to 0, these checks are disabled, slightly increasing performance.
  • mpi_show_handle_leaks: If set to true (any positive value), OMPI will display lists of any MPI handles that were not freed before MPI_FINALIZE (e.g., communicators, datatypes, requests, etc.).
  • mpi_no_free_handles: If set to true (any positive value), do not actually free MPI object when their corresponding MPI "free" function (e.g., do not free communicators when MPI_COMM_FREE is invoked). This can be helpful in tracking down applications that accidentally continue to use MPI handles after they have been freed.
  • mpi_show_mca_params: If set to true (any positive value), show a list of all MCA parameters and their values during MPI_INIT. This can be quite helpful for reproducability of MPI applications.
  • mpi_show_mca_params_file: If set to a non-empty value, and if the value of mpi_show_mca_params is true, then output the list of MCA parameters to the filname value. If this parameter is an empty value, the list is sent to stderr.
  • mpi_keep_peer_hostnames: If set to a true value (any positive value), send the list of all hostnames involved in the MPI job to every process in the job. This can help the specificity of error messages that Open MPI emits if a problem occurs (i.e., Open MPI can display the name of the peer host that it was trying to communicate with), but it can somewhat slow down the startup of large-scale MPI jobs.
  • mpi_abort_delay: If nonzero, print out an identifying message when MPI_ABORT is invoked showing the hostname and PID of the process that invoked MPI_ABORT, and then delay that many seconds before exiting. A negative value means to delay indefinitely. This allows a user to manually come in and attach a debugger when an error occurs. Remember that the default MPI error handler -- MPI_ERRORS_ABORT -- invokes MPI_ABORT, so this parameter can be useful to discover problems identified by mpi_param_check.
  • mpi_abort_print_stack: If nonzero, print out a stack trace (on supported systems) when MPI_ABORT is invoked.
  • mpi_ddt_<foo>_debug, where <foo> can be one of pack, unpack, position, or copy: These are intenral debugging features that are not intended for end users (but ompi_info will report that they exist).


159. Do I need to build Open MPI with compiler/linker debugging flags (such as -g) to be able to debug MPI applications?

No.

If you build Open MPI without compiler/linker debugging flags (such as -g), you will not be able to step inside MPI functions when you debug your MPI applications. However, this is likely what you want -- the internals of Open MPI are quite complex and you probably don't want to start poking around in there.

You'll need to compile your own applications with -g (or whatever your compiler's equivalent is), but unless you have a need/desire to be able to step into MPI functions to see the internals of Open MPI, you do not need to build Open MPI with -g.


160. Can I use serial debuggers (such as gdb) to debug MPI applications?

Yes; the Open MPI developers do this all the time.

There are two common ways to use serial debuggers:

  1. Attach to individual MPI processes after they are running.

    For example, launch your MPI application as normal with mpirun. Then login to the node(s) where your application is running and use the --pid option to gdb to attach to your application.

    An inelegant-but-functional technique commonly used with this method is to insert the following code in your application where you want to attach:

    {
        int i = 0;
        char hostname[256];
        gethostname(hostname, sizeof(hostname));
        printf("PID %d on %s ready for attach\n", getpid(), hostname);
        fflush(stdout);
        while (0 == i)
            sleep(5);
    }
    

    This code will output a line to stdout outputting the name of the host where the process is running and the PID to attach to. It will then spin on the sleep() function forever waiting for you to attach with a debugger. Using sleep() as the inside of the loop means that the processor won't be pegged at 100% while waiting for you to attach.

    Once you attach with a debugger, go up the function stack until you are in this block of code (you'll likely attach during the sleep()) then set the variable i to a nonzero value. With GDB, the syntax is:

    (gdb) set var i = 7
    

    Then set a breakpoint after your block of code and continue execution until the breakpoint is hit. Now you have control of your live MPI application and use the full functionality of the debugger.

    You can even add conditionals to only allow this "pause" in the application for specific MPI processes (e.g., MPI_COMM_WORLD rank 0, or whatever process is misbehaving).

  2. Use mpirun to launch xterms (or equivalent) with serial debuggers.

    This technique launches a separate window for each MPI process in MPI_COMM_WORLD, each one running a serial debugger (such as gdb) that will launch and run your MPI application. Having a separate window for each MPI process can be quite handy for low process-count MPI jobs, but requires a bit of setup and configuration that is outside of Open MPI to work properly. A naieve approach would be to assume that the following would immediately work:

    shell$ mpirun -np 4 xterm -e gdb my_mpi_application
    

    Unfortunately, it likely won't work. Several factors must be considered:

    1. What launcher is Open MPI using? In an rsh/ssh environment, Open MPI will default to using ssh when it is available, falling back to rsh when ssh cannot be found in the $PATH. But note that Open MPI closes the ssh (or rsh) sessions when the MPI job starts for scalability reasons. This means that the built-in SSH X forwarding tunnels will be shut down before the xterms can be launched. Although it is possible to force Open MPI to keep its SSH connections active (to keep the X tunneling available), we recommend using non-SSH-tunneled X connections, if possible (see below).
    2. In non-rsh/ssh environments (such as when using resource managers), the environment of the process invoking mpirun may be copied to all nodes. In this case, the DISPLAY environment variable may not be suitable.
    3. Some operating systems default to disabling the X11 server from listening for remote/network traffic. For example, see this post on the user's mailing list, describing how to enable network access to the X11 on Fedora Linux.
    4. There may be intermediate firewalls or other network blocks that prevent X traffic from flowing between the hosts where the MPI processes (and xterms) are running and the host connected to the output display.

    The easiest way to get remote X applications (such as xterm) to display on your local screen is to forego the security of SSH-tunneled X forwarding. In a closed environment such as an HPC cluster, this may be an acceptable practice (indeed, you may not even have the option of using SSH X forwarding if you SSH logins to cluster nodes are disabled), but check with your security administrator to be sure.

    If using non-encrypted X11 forwarding is permissable, we recommend the following:

    1. For each non-local host where you will be running an MPI process, add it to your X server's permission list with the xhost command. For example:

      shell$ cat my_hostfile
      inky
      blinky
      stinky
      clyde
      shell$ for host in `cat my_hostfile` ; do xhost +host ; done
      

    2. Use the -x option to mpirun to export an appropriate DISPLAY variable so that the launched X applications know where to send their output. An appropriate value is usually (but not always) the hostname containing the display where you want the output and the :0 (or :0.0) suffix. For example:

      shell$ hostname
      arcade.example.come
      shell$ mpirun -np 4 --hostfile my_hostfile \
          -x DISPLAY=arcade.example.com:0 xterm -e gdb my_mpi_application
      

      Note that X traffic is fairly "heavy" -- if you are operating over a slow network connection, it may take some time before the xterm windows appear on your screen.

    3. If your xterm supports it, the -hold option may be useful. -hold tells xterm to stay open even when the application has completed. This means that if something goes wrong (e.g., gdb fails to execute, or unexpectedly dies, or ...), the xterm window will stay open allowing you to see what happened, instead of closing immediately and losing whatever error message may have been output.
    4. When you have finished, you may wish to disable X11 network permissions from the hosts that you were using. Use xhost again to disable these permissions:

      shell$ for host in `cat my_hostfile` ; do xhost -host ; done
      

    Note that mpirun will not complete until all the xterms complete.


161. My process dies without any output. Why?

There many be many reasons for this; the Open MPI Team strongly encourages the use of tools (such as debuggers) whenever possible.

One of the reaons, however, may come from inside Open MPI itself. If your application fails due to memory corruption, Open MPI may subsequently fail to output an error message before dying. Specifically, starting with v1.3, Open MPI attempts to aggregate error messages from multiple processes in an attempt to show unique error messages only once (vs. one for each MPI process -- which can be unweildly, especially when running large MPI jobs).

However, this aggregation process requires allocating memory in the MPI process when it displays the error message. If the process' memory is already corrupted, Open MPI's attempt to allocate memory may fail and the process will simply die, possibly silently. When Open MPI does not attempt to aggregate error messages, most of its setup work is done during MPI_INIT and no memory is allocated during the "print the error" routine. It therefore almost always successfully outputs error messages in real time -- but at the expense that you'll potentially see the same error message for each MPI process that encourntered the error.

Hence, the error message aggregation is usually a good thing, but sometimes it can mask a real error. You can disable Open MPI's error message aggregation with the orte_base_help_aggregate MCA parameter. For example:

shell$ mpirun --mca orte_base_help_aggregate 0 ...


162. What is Memchecker?

The Memchecker-MCA is implemented to allow MPI-semantic checking for your application, (as well as internals of Open MPI) with the help of memory checking tools such as the Memcheck of the Valgrind-suite (http://www.valgrind.org/).

Memchecker component is included in Open MPI v1.3 and later.


163. What kind of errors can Memchecker find?

Memchecker is implemented on the basis of Memcheck tool from Valgrind, so it takes all the advantages from it. Firstly, it checks all reads and writes of memory, and intercepts calls to malloc/new/free/delete. Most importantly, Memchecker is able to detect the user buffer errors in both Non-blocking and One-sided communications, e.g. reading or writing to buffers of active non-blocking Recv-operations and writing to buffers of active non-blocking Send-operations.

Here are some example codes that Memchecker can detect:

Accessing buffer under control of non-blocking communication:

int buf;
MPI_Irecv(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
/* The following line will produce a memchecker warning */
buf = 4711;
MPI_Wait (&req, &status);

Wrong input parameters, e.g. wrongly sized send buffers:

char *send_buffer;
send_buffer = malloc(5);
memset(send_buffer, 0, 5);
/* The following line will produce a memchecker warning */
MPI_Send(send_buffer, 10, MPI_CHAR, 1, 0, MPI_COMM_WORLD);

Accessing window under control of one-sided communication:

MPI_Get(A, 10, MPI_INT, 1, 0, 1, MPI_INT, win);
A[0] = 4711;
MPI_Win_fence(0, win); 

Uninitialized input buffers:

char *buffer;
buffer = malloc(10);
/* The following line will produce a memchecker warning */
MPI_Send(buffer, 10, MPI_INT, 1, 0, MPI_COMM_WORLD);

Usage of the uninitialized MPI_Status field in MPI_ERROR structure: (The MPI-1 standard defines the MPI ERROR-field to be undefined for single-completion calls such as MPI Wait or MPI Test, see MPI-1 p. 22):

MPI_Wait(&request, &status);
/* The following line will produce a memchecker warning */
if (status.MPI_ERROR != MPI_SUCCESS)
    return ERROR;


164. How can I use Memchecker?

To use Memchecker, you need Open MPI 1.3 or later, and Valgrind 3.2.0 or later.

As this functionality is off by default, one needs to turn them on with the configure flag --enable-memchecker. Then, configure will check for a recent Valgrind-distribution and include the compilation of ompi/opal/mca/memchecker. You may ensure, that the library is being built, by using the ompi_info application. Please note, that all of this will only make sense together with --enable-debug, which is required by Valgrind for outputing messages pointing directly to the relevant source code lines. Otherwise, without debugging info, the messages from Valgrind are nearly useless.

Here is a configuration example to enable Memchecker:

shell$ ./configure --prefix=/path/to/openmpi --enable-debug \
    --enable-memchecker --with-valgrind=/path/to/valgrind

To check if Memchecker is successfully enabled after installation, simply run command:

shell$ ompi_info | grep memchecker

You will get the output message like this:

    MCA memchecker: valgrind (MCA v1.0, API v1.0, Component v1.3)

Otherwise, you probably didn't configure and install Open MPI correctly.


165. How to run my MPI application with Memchecker?

First of all, you have to make sure that Valgrind 3.2.0 or later is installed, and Open MPI is compiled with Memchecker enabled. Then simply run you application with Valgrind, e.g.:

shell$ mpirun -np 2 valgrind ./my_app

Or if you enabled Memchecker, but you don't want to check the application at this time, then just run your application as usual. E.g.:

shell$ mpirun -np 2 ./my_app


166. Does Memchecker cause performance degradation to my application?

The configure option --enable-memchecker (together with --enable-debug) does cause performance degradation, even if not running under Valgrind. The following explains the mechanism and may help in making the decision whether to provide a cluster-wide installation with --enable-memchecker.

There are two cases:

Further information and performance data with the NAS Parallel Benchmarks may be found in the paper Memory Debugging of MPI-Parallel Applications in Open MPI.


167. Is Open MPI 'Valgrind-clean' or how can I identify real errors?

This issue has been raised many times on the mailing list, e.g., here or here.

There are many situations, where Open MPI purposefully does not initialize and subsequently communicates memory, e.g., by calling writev. Furthermore, several cases are known, where memory is not properly freed upon MPI_Finalize.

This certainly does not help distinguishing real errors from false positives. Valgrind provides functionality to suppress errors and warnings from certain function contexts.

In an attempt to ease debugging using Valgrind, starting with v1.5, Open MPI provides a so-called Valgrind-suppression file, that can be passed on the command line:

mpirun -np 2 valgrind --suppressions=$PREFIX/share/openmpi/openmpi-valgrind.supp

More information on suppression-files and how to generate them can be found in Valgrind's Documentation.


168. Can I make Open MPI use rsh instead of ssh?

Yes. The method to do this has changed over the different versions of Open MPI.

  1. v1.3 series: The orte_rsh_agent MCA parameter accepts a colon-delimited list of programs to search for in your path to use as the remote startup agent (the MCA parameter name plm_rsh_agent also works, but it is deprecated). The default value is "ssh : rsh", meaning that it will look for ssh first, and if it doesn't find it, use rsh. You can change the value of this parameter as relevant to your environment, such as simply changing it to rsh or rsh : ssh if you have a mixture.
  2. v1.1 and v1.2 series: The v1.1 and v1.2 method is exactly the same as the v1.3 method, but the MCA parameter name is slightly different: pls_rsh_agent ("pls" vs. "plm"). Using the old "pls" name will continue to work in the v1.3 series, but it is now officially deprecated -- you'll receive a warning if you use it.
  3. v1.0 series: In the 1.0.x series, Open MPI defaults to using ssh for remote startup of processes in unscheduled environments. You can change this to rsh by setting the MCA parameter pls_rsh_agent to rsh.

See this FAQ entry for details on how to set MCA parameters -- particularly with multi-word values.


169. What pre-requisites are necessary for running an Open MPI job under rsh/ssh?

In general, they are the same for running Open MPI jobs in other environments (see this FAQ category for more general information).


170. How can I make ssh not ask me for a password?

If you are using rsh to launch processes on remote nodes,

There are multiple ways.

Note that there are multiple versions of ssh available. References to ssh in this text refer to OpenSSH.

This documentation provides an overview for using user keys and the OpenSSH 2.x key management agent (if your OpenSSH only supports 1.x key management, you should upgrade). See the OpenSSH documentation for more details and a more thorough description. The process is essentially the same for other versions of SSH, but the command names and filenames may be slightly different. Consult your SSH documentation for more details.

Normally, when you use ssh to connect to a remote host, it will prompt you for your password. However, the easiest way for mpirun (and mpiexec, which, in Open MPI, is identical to mpirun) to work properly, you need to be able to execute jobs on remote nodes without typing in a password. In order to do this, you will need to set up passphrase We recomend using RSA passphrases as they is generally "better" (i.e., more secure) than DSA passphrases. As such, this text will describe the process for RSA setup.

NOTE: This text will briefly show you the steps involved in doing this, but the ssh documentation is authorative on these matters should be consulted for more information.

The first thing that you need to do is generate an RSA key pair to use with ssh-keygen:

shell$ ssh-keygen -t rsa

Accept the default value for the file in which to store the key ([$HOME/.ssh/id_rsa]) and enter a passphrase for your key pair. You may choose to not enter a passphrase and therefore obviate the need for using the ssh-agent. However, this greatly weakens the authentication that is possible, because your secret key is potentially vulnerable to compromise because it is unencrypted. It has been compared to the moral equivalent of leaving a plain text copy of your password in your $HOME directory. See the ssh documentation for more details.

Next, copy the $HOME/.ssh/id_rsa.pub file generated by ssh-keygen to $HOME/.ssh/authorized_keys (or add it to the end of authorized_keys if that file already exists):

shell$ cd $HOME/.ssh
shell$ cp id_rsa.pub authorized_keys

In order for RSA authentication to work, you need to have the $HOME/.ssh directory in your home directory on all the machines you are running Open MPI. If your home directory is on a common filesystem, this may be already taken care of. If not, you will need to copy the $HOME/.ssh directory to your home directory on all Open MPI nodes (be sure to do this in a secure manner -- perhaps using the scp command -- particularly if your secret key is not encrypted).

ssh is very particular about file permissions. Ensure that your home directory on all your machines is set to at least mode 755, your $HOME/.ssh directory is also set to at least mode 755, and that the following files inside $HOME/.ssh have at least the following permissions:

-rw-r--r--  authorized_keys
-rw-------  id_rsa
-rw-r--r--  id_rsa.pub
-rw-r--r--  known_hosts

The phrase "at least" in the above paragraph means the following:

  • The files need to be readable by you
  • The files should only be writable by you
  • The files should not be executable
  • Aside from id_rsa, the files can be readable by others, but do not need to be
  • Your $HOME and $HOME/.ssh directories can be readable by others, but do not need to be

You are now set up to use RSA authentication. However, when you ssh to a remote host, you will still be asked for your RSA passphrase (as opposed to your normal password). This is where the ssh-agent program comes in. It allows you to type in your RSA passphrase once, and then have all successive invocations of ssh automatically authenticate you against the remote host. See the ssh-agent(1) documentation for more details than what are provided here.

Additionally, check the documentation and setup of your local environment; ssh-agent may already be setup for you (e.g., see if the shell environment variable $SSH_AUTH_SOCK exists; if so, ssh-agent is likely already running). If ssh-agent is not already running, you can start it manually with the following:

shell$ eval `ssh-agent`

Note the specific invocation method: ssh-agent outputs some shell commands to its output (e.g., setting the SSH_AUTH_SOCK environment variable).

You will probably want to start the ssh-agent before you start your graphics / windowing system so that all your windows will inherit the environment variables set by this command. Note that some sites invoke ssh-agent for each user upon login automatically; be sure to check and see if there is an ssh-agent running for you already.

Once the ssh-agent is running, you can tell it your passphrase by running the ssh-add command:

shell$ ssh-add $HOME/.ssh/id_rsa

At this point, if you ssh to a remote host that has the same $HOME/.ssh directory as your local one, you should not be prompted for a password or passphrase. If you are, a common problem is that the permissions in your $HOME/.ssh directory are not as they should be.

Note that this text has covered the ssh commands in very little detail. Please consult the ssh documentation for more information.


171. What is a .rhosts file? Do I need it?

If you are using rsh to launch processes on remote nodes, you will probably need to have a $HOME/.rhosts file.

This file allows you to execute commands on remote nodes without being prompted for a password. The permissions on this file usually must be 0644 ([rw-r--r--]). It must exist in your home directory on every node that you plan to use Open MPI with.

Each line in the .rhosts file indicates a machine and user that programs may be launched from. For example, if the user steve wishes to launch programs from the machine stevemachine to the machines alpha, beta, and gamma, there must be a .rhosts file on each of the three remote machines ([alpha], beta, and gamma) with at least the following line in it:

stevemachine steve

The first field indicates the name of the machine where jobs may originate from; the second field indicates the user ID who may originate jobs from that machine. It is better to supply a fully-qualified domain name for the machine name (for security reasons -- there may be many machines named stevemachine on the internet). So the above example should be:

stevemachine.example.com steve

The Open MPI Team strongly discourages the use of "+" in the .rhosts file. This is always a huge security hole.

If rsh does not find a matching line in the $HOME/.rhosts file, it will prompt you for a password. Open MPI requires the password-less execution of commands; if rsh prompts for a password, mpirun will fail.

NOTE: Some implementations of rsh are very picky about the format of text in the .rhosts file. In particular, some do not allow leading white space on each line in the .rhosts file, and will give a misleading "permission denied" error if you have white space before the machine name.

NOTE: It should be noted that rsh is not considered "secure" or "safe" -- .rhosts authentication is considered fairly weak. The Open MPI Team recommends that you use ssh ("Secure Shell") to launch remote programs as it uses a much stronger authentication system.


172. Should I use + in my .rhosts file?

No!

While there are a very small number of cases where using "+" in your .rhosts file may be acceptable, the Open MPI Team highly recommends that you do not.

Using a "+" in your .rhosts file indicates that you will allow any machine and/or any user to connect as you. This is extremely dangerous, especially on machines that are connected to the internet. Consider the fact that anyone on the internet can connect to your machine (as you) -- it should strike fear into your heart.

The + should not be used for either field of the .rhosts file.

Instead, you should use the full and proper hostname and username of accounts that are authorized to remotely login as you to that machine (or machines). This is usually just a list of your own username on a list of machines that you wish to run Open MPI with. See this FAQ entry for further details, as well as your local rsh documentation.

Additionally, the Open MPI Team strongly recommends that rsh is not used in unscheduled environments (espectially those connected to the internet) -- it is considered weak remote authentication. Instead, we recommend the use of ssh -- the secure remote shell. See this FAQ entry for more details.


173. What versions of BProc does Open MPI work with?

BProc support was dropped from Open MPI in the Open MPI v1.3 series.

The last version of Open MPI to include BProc support was Open MPI 1.2.9, which was released in February of 2009.

As of December 2005, Open MPI supports recent versions of BProc, such as those found in Clustermatic. We have not tested with older forks of the BProc project, such as those from Scyld (now defunct). Since Open MPI's BProc support uses some advanced support from recent BProc versions, it is somewhat doubtful (but totally untested) as to whether it would work on Scyld systems.


174. What pre-requisites are necessary for running an Open MPI job under BProc?

In general, they are the same for running Open MPI jobs in other environments (see this FAQ category for more general information).

However, BProc it is worth noting that BProc may not bring all necessary dynamic libraries with a process when it is migrated to a back-end compute node. Plus, Open MPI opens components on the fly (i.e., after the process has started), so if these components are unavailable on the back-end compute nodes, Open MPI applications may fail.

In general the Open MPI team recomends one of the following two solutions when running on BProc clusters (in order):

  1. Compile Open MPI statically, meaning that Open MPI's libraries produce static ".a" libraries and all components are included in the library (as opposed to dynamic ".so" libraries, and separate ".so" files for each component that are found and loaded at run-time) so that applications do not need to find any shared libraries or components when they are migrated to back-end compute nodes. This can be accomplished by specifying --enable-static --disable-shared to configure script when building Open MPI.
  2. If you do not wish to use static compilation, ensure that Open MPI is fully installed on all nodes (i.e., the head node and all compute nodes) in the same directory location. For example, if Open MPI is installed in /opt/openmpi-1.8.1 on the head node, ensure that it is also installed in that same directory on all the compute nodes.


175. How do I run jobs under Torque / PBS Pro?

The short answer is just to use mpirun as normal.

Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Torque / PBS Pro directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will use PBS/Torque-native mechanisms to launch and kill processes ([rsh] and/or ssh are not required).

For example:

# Allocate a PBS job with 4 nodes
shell$ qsub -I -lnodes=4
# Now run an Open MPI job on all the nodes allocated by PBS/Torque
# (starting with Open MPI v1.2; you need to specify -np for the 1.0
# and 1.1 series).
shell$ mpirun my_mpi_application

This will run the 4 MPI processes on the nodes that were allocated by PBS/Torque. Or, if submitting a script:

shell$ cat my_script.sh
#!/bin/sh
mpirun my_mpi_application
shell$ qsub -l nodes=4 my_script.sh


176. Does Open MPI support Open PBS?

As of this writing, Open PBS is so ancient that we are not aware of any sites running it. As such, we have never tested Open MPI with Open PBS and therefore do not know if it would work or not.


177. How does Open MPI get the list of hosts from Torque / PBS Pro?

Open MPI has changed how it obtains hosts from Torque / PBS Pro over time:

  • v1.0 and v1.1 series: The list of hosts allocated to a Torque / PBS Pro job is obtained directly from the scheduler using the internal TM API.
  • v1.2 series: Due to scalability limitations in how the TM API was used in the v1.0 and v1.1 series, Open MPI was modified to read the $PBS_NODEFILE to obtain hostnames. Specifically, reading the $PBS_NODEFILE is much faster at scale than how the v1.0 and v1.1 series used the TM API.

It is possible that future versions of Open MPI may switch back to using the TM API in a more scalable fashion, but there isn't currently a huge demand for it (reading the $PBS_NODEFILE works just fine).

Note that the TM API is used to launch processes in all versions of Open MPI; the only thing that has changed over time is how Open MPI obtains hostnames.


178. What happens if $PBS_NODEFILE is modified?

Bad Things will happen.

We've had reports from some sites that system administrators modify the $PBS_NODEFILE in each job according to local policies. This will currently cause Open MPI to behave in an unpredictable fashion. As long as no new hosts are added to the hostfile, it usually means that Open MPI will incorrectly map processes to hosts, but in some cases it can cause Open MPI to fail to launch processes altogether.

The best course of action is to not modify the $PBS_NODEFILE.


179. Can I specify a hostfile or use the --host option to mpirun when running in a Torque / PBS environment?

As of version v1.2.1, no.

Open MPI will fail to launch processes properly when a hostfile is specifed on the mpirun command line, or if the mpirun [--host] option is used.

We're working on correcting the error. A future version of Open MPI will likely launch on the hosts specified either in the hostfile or via the --host option as long as they are a proper subset of the hosts allocated to the Torque / PBS Pro job.


180. How do I run with the SGE launcher?

Support for SGE is included in Open MPI version 1.2 and later.

NOTE: To build SGE support in v1.3, you will need to explicitly request the SGE support with the "--with-sge" command line switch to Open MPI's configure script.

See this FAQ entry for a description of how to correctly build Open MPI with SGE support.

To verify if support for SGE is configured into your Open MPI installation, run ompi_info as shown below and look for gridengine. The components you will see are slightly different between v1.2 and v1.3.

For Open MPI 1.2:

shell$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v1.0, API v1.0, Component v1.2)
                 MCA pls: gridengine (MCA v1.0, API v1.0, Component v1.2)

For Open MPI 1.3:

shell$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

Open MPI will automatically detect when it is running inside SGE and will just "do the Right Thing."

Specifically, if you execute an mpirun command in a SGE job, it will automatically use the SGE mechanisms to launch and kill processes. There is no need to specify what nodes to run on -- Open MPI will obtain this information directly from SGE and default to a number of processes equal to the slot count specified. For example, this will run 4 MPI processes on the nodes that were allocated by SGE:

# Get the environment variables for SGE
# (Assuming SGE is installed at /opt/sge and $SGE_CELL is 'default' in your environment)
# C shell settings
shell% source /opt/sge/default/common/settings.csh

# bourne shell settings
shell$ . /opt/sge/default/common/settings.sh

# Allocate an SGE interactive job with 4 slots from a parallel
# environment (PE) named 'orte' and run a 4-process Open MPI job
shell$ qrsh -pe orte 4 -b y mpirun -np 4 a.out

There are also other ways to submit jobs under SGE:

# Submit a batch job with the 'mpirun' command embedded in a script
shell$ qsub -pe orte 4 my_mpirun_job.csh

# Submit an SGE and OMPI job and mpirun in one line
shell$ qrsh -V -pe orte 4 mpirun hostname

# Use qstat(1) to show the status of SGE jobs and queues
shell$ qstat -f

In reference to the setup, be sure you have a Parallel Environment (PE) defined for submitting parallel jobs. You don't have to name your PE "orte". The following example shows a PE named 'orte' that would look like:

% qconf -sp orte
   pe_name            orte
   slots              99999
   user_lists         NONE
   xuser_lists        NONE
   start_proc_args    NONE
   stop_proc_args     NONE
   allocation_rule    $fill_up
   control_slaves     TRUE
   job_is_first_task  FALSE
   urgency_slots      min
   accounting_summary FALSE
   qsort_args         NONE

   "qsort_args" is necessary with the Son of Grid Engine distribution,
   version 8.1.1 and later, and probably only applicable to it.  For
   very old versions of SGE, omit "accounting_summary" too.

   You may want to alter other parameters, but the important one is
   "control_slaves", specifying that the environment has "tight
   integration".  Note also the lack of a start or stop procedure.
   The tight integration means that mpirun automatically picks up the
   slot count to use as a default in place of the '-np' argument,
   picks up a host file, spawns remote processes via 'qrsh' so that
   SGE can control and monitor them, and creates and destroys a
   per-job temporary directory ($TMPDIR), in which Open MPI's
   directory will be created (by default).
 
   Be sure the queue will make use of the PE that you specified:


% qconf -sq all.q
...
pe_list               make cre orte
...

To determine whether the SGE parallel job is successfully launched to the remote nodes, you can pass in the MCA parameter "--mca plm_base_verbose 1" to mpirun.

This will add in a -verbose flag to qrsh -inherit command that is used to send parallel tasks to the remote SGE execution hosts. It will show whether the connections to the remote hosts are established successfully or not.

Various SGE documentation with pointers to more is available at the Son of GridEngine site, and configuration instructions can be found at the Son of GridEngine configuration how-to site..


181. Does the SGE tight integration support the -notify flag to qsub?

If you are running SGE6.2 Update 3 or later, then the -notify flag is supported. If you are running earlier versions, then the -notify flag will not work and using it will cause the job to be killed.

To use -notify, one has to be a careful. First, let us review what -notify does. Here is an excerpt from the qsub man page for the -notify flag.

-notify
This flag, when set causes Sun Grid Engine to send
warning signals to a running job prior to sending the
signals themselves. If a SIGSTOP is pending, the job
will receive a SIGUSR1 several seconds before the SIGSTOP.
If a SIGKILL is pending, the job will receive a SIGUSR2
several seconds before the SIGKILL. The amount of time
delay is controlled by the notify parameter in each
queue configuration.

Let us assume you the reason you want to use the -notify flag is to get the SIGUSR1 signal prior to getting the SIGTSTP signal. As mentioned in this this FAQ entry one could run the job as shown in this batch script.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
mpirun -np 16 -mca orte_forward_job_control 1 a.out

However, one has to make one of two changes to this script for things to work properly. By default, a SIGUSR1 signal will kill a shell script. So we have to make sure that does not happen. Here is one way to handle it.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
exec mpirun -np 16 -mca orte_forward_job_control 1 a.out

Alternatively, one can catch the signals in the script instead of doing an exec on the mpirun.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
function sigusr1handler()
{
        echo "SIGUSR1 caught by shell script" 1>&2
}
function sigusr2handler()
{
        echo "SIGUSR2 caught by shell script" 1>&2
}
trap sigusr1handler SIGUSR1
trap sigusr2handler SIGUSR2
mpirun -np 16 -mca orte_forward_job_control 1 a.out


182. Can I suspend and resume my job?

A new feature was added into Open MPI 1.3.1 that supports suspend/resume of an MPI job. To suspend the job, you send a SIGTSTP (not SIGSTOP) signal to mpirun. mpirun will catch this signal and forward it to the a.outs as a SIGSTOP signal. To resume the job, you send a SIGCONT signal to mpirun which will be caught and forwarded to the a.outs.

By default, this feature is not enabled. This means that both the SIGTSTP and SIGCONT signals will simply be consumed by the mpirun process. To have them forwarded, you have to run the job with --mca orte_forward_job_control 1. Here is an example on Solaris.

shell$ mpirun -mca orte_forward_job_control 1 -np 2 a.out

In another window, we suspend and continue the job.

shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15305 rolfv     158M   22M cpu1     0    0   0:00:21 5.9% a.out/1
 15303 rolfv     158M   22M cpu2     0    0   0:00:21 5.9% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1

shell$ kill -TSTP 15301
shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15303 rolfv     158M   22M stop    30    0   0:01:44  21% a.out/1
 15305 rolfv     158M   22M stop    20    0   0:01:44  21% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1

shell$ kill -CONT 15301
shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15305 rolfv     158M   22M cpu1     0    0   0:02:06  17% a.out/1
 15303 rolfv     158M   22M cpu3     0    0   0:02:06  17% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1

Note that all this does is stop the a.outs. It does not, for example, free any pinned memory when the job is in the suspended state.

To get this to work under the SGE environment, you have to change the suspend_method entry in the queue. It has to be set to SIGTSTP. Here is an example of what a queue should look like.

sheel$ qconf -sq all.q
qname                 all.q
[...snip...]
starter_method        NONE
suspend_method        SIGTSTP
resume_method         NONE 

Note that if you need to suspend other types of jobs with SIGSTOP (instead of SIGTSTP) in this queue then you need to provide a script that can implement the correct signals for each job type.


183. How do I run jobs under SLURM?

The short answer is yes, provided you configured OMPI --with-slurm. You can use mpirun as normal, or directly launch your application using srun if OMPI is configured per this FAQ entry.

The longer answer is that Open MPI supports launching parallel jobs in all three methods that SLURM supports:

  1. Launching via "salloc ...": supported (older versions of SLURM used "srun -A ...")
  2. Launching via "sbatch ...": supported (older versions of SLURM used "srun -B ...")
  3. Launching via "srun -n X my_mpi_application"

Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.

Open MPI automatically obtains both the list of hosts and how many processes to start on each host from SLURM directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will also use SLURM-native mechanisms to launch and kill processes ([rsh] and/or ssh are not required).

For example:

# Allocate a SLURM job with 4 nodes
shell$ salloc -N 4 sh
# Now run an Open MPI job on all the nodes allocated by SLURM
# (Note that you need to specify -np for the 1.0 and 1.1 series;
# the -np value is inferred directly from SLURM starting with the 
# v1.2 series)
shell$ mpirun my_mpi_application

This will run the 4 MPI processes on the nodes that were allocated by SLURM. Equivalently, you can do this:

# Allocate a SLURM job with 4 nodes and run your MPI application in it
shell$ salloc -N 4 mpirun my_mpi_aplication

Or, if submitting a script:

shell$ cat my_script.sh
#!/bin/sh
mpirun my_mpi_application
shell$ sbatch -N 4 my_script.sh
srun: jobid 1234 submitted
shell$


184. Doe Open MPI support "srun -n X my_mpi_application"?

Yes, if you have configured OMPI --with-pmi=foo, where foo is the path to the directory where pmi.h/pmi2.h is located. Slurm (> 2.6, > 14.03) installs PMI-2 support by default.

Older versions of Slurm install PMI-1 by default. If you desire PMI-2, Slurm requires that you manually install that support. When the --with-pmi option is given, OMPI will automatically determine if PMI-2 support was built and use it in place of PMI-1.


185. I use SLURM on a cluster with the OpenFabrics network stack. Do I need to do anything special?

Yes. You need to ensure that SLURM sets up the locked memory limits properly. Be sure to see this FAQ entry about locked memory and this FAQ entry for references about SLURM.


186. Any issues with Slurm 2.6.3?

Yes. The Slurm 2.6.3, 14.03 releases have a bug in their PMI-2 support.

For the slurm-2.6 branch, it is recommended to use the latest version (2.6.9 as of 2014/4), which is known to work properly with pmi2.

For the slurm-14.03 branch, the fix will be in 14.03.1.


187. How do I reduce startup time for jobs on large clusters?

There are several ways to reduce the startup time on large clusters. Some of them are described on this page. We continue to work on making startup even faster, especially on the large clusters coming in future years.

Open MPI 1.3 is significantly faster and more robust than its predecessors. We recommend that anyone running large jobs and/or on large clusters make the upgrade to the 1.3 series.


188. Where should I put my libraries: Network vs. local filesystems?

Open MPI itself doesn't really care where its libraries are stored. However, where they are stored does have an impact on startup times, particularly for large clusters, which can be mitigated somewhat through use of Open MPI's configuration options.

Startup times will always be minimized by storing the libraries local to each node, either on local disk or in RAM-disk. The latter is sometimes problematic since the libraries do consume some space, thus potentially reducing memory that would have been available for MPI processes.

There are two main considerations for large clusters that need to place the Open MPI libraries on networked file systems:

  • While DSO's are more flexible, you definitely do not want to use them when the Open MPI libraries will be mounted on a network file system! Doing so will lead to significant network traffic and delayed start times, especially on clusters with a large number of nodes. Instead, be sure to configure your build with --disable-dlopen. This will include the DSO's in the main libraries, resulting in much faster startup times.
  • Many networked file systems use automount for user level directories, as well as for some locally administered system directories. There are many reasons why system administrators may choose to automount such directories. MPI jobs, however, tend to launch very quickly, thereby creating a situation wherein a large number of nodes will nearly simultaneously demand automount of a specific directory. This can overload NFS servers, resulting in delayed response or even failed automount requests.

    Note that this applies to both automount of directories containing Open MPI libraries as well as directories containing user applications. Since these are unlikely to be the same location, multiple automount requests from each node are possible, thus increasing the level of traffic.


189. Static vs shared libraries?

It is perfectly fine to use either shared or static libraries. Shared libraries will save memory when operating multiple processes per node, especially on clusters with high numbers of cores on a node, but can also take longer to launch on networked file systems (see network vs. local filesystem FAQ entry for suggestions on how to mitigate such problems).


190. How do I reduce the time to wireup OMPI's out-of-band communication system?

Open MPI's run-time uses an out-of-band (OOB) communication subsystem to pass messages during the launch, initialization, and termination stages for the job. These messages allow mpirun to tell its daemons what processes to launch, and allow the daemons in turn to forward stdio to mpirun, update mpirun on process status, etc.

The OOB uses TCP sockets for its communication, with each daemon opening a socket back to mpirun upon startup. In a large cluster, this can mean thousands of connections being formed on the node where mpirun resides, and requires that mpirun actually process all these connection requests. Mpirun defaults to processing connection requests sequentially - so on large clusters, a backlog can be created that can cause remote daemons to timeout waiting for a response.

Fortunately, Open MPI provides an alternative mechanism for processing connection requests that helps alleviate this problem. Setting the MCA parameter oob_tcp_listen_mode to listen_thread causes mpirun to startup a separate thread dedicated to responding to connection requests. Thus, remote daemons receive a quick response to their connection request, allowing mpirun to deal with the message as soon as possible.

This parameter can be included in the default MCA parameter file, placed in the user's environment, or added to the mpirun command line. See this FAQ entry for more details on how to set MCA parameters.


191. Why is my job failing because of file descriptor limits?

This is a known issue in Open MPI releases prior to the v1.3 series. The problem lies in the connection topology for Open MPI's out-of-band (OOB) communication subsystem. Prior to the 1.3 series, a fully-connected topology was used that required every process to open a connection to every other process in the job. This can rapidly overwhelm the usual system limits.

There are two methods you can use to circumvent the problem. First, upgrade to the v1.3 series if you can - this would be our recommended approach as there are considerable improvements in that series vs. the 1.2 one.

If you cannot upgrade and must stay with the v1.2 series, then you need to increase the number of file descriptors in your system limits. This commonly requires that your system administrator increase the number of file descriptors allowed by the system itself. The number required depends both on the number of nodes in your cluster and the max number of processes you plan to run on each node. Assuming you want to allow jobs that fully occupy the cluster, than the minimum number of file descriptors you will need is roughly (#procs_on_a_node+1) * #procs_in_the_job.

It is always wise to have a few extra just in case :-)

Note that this only covers the file descriptors needed for the out-of-band communication subsystem. It specifically does not address file descriptors needed to support the MPI TCP transport, if that is being used on your system. If it is, then additional file descriptors will be required for those TCP sockets. Unfortunately, a simple formula cannot be provided for that value as it depends completely on the number of point-to-point TCP connections being made. If you believe that users may want to fully connect an MPI job via TCP, then it would be safest to simply double the number of file descriptors calculated above.

This can, of course, get to be a really big number...which is why you might want to consider upgrading to the v1.3 series, where OMPI only opens #nodes OOB connections on each node. We are currently working on even more sparsely connected topologies for very large clusters, with the goal of constraining the number of connections opened on a node to an arbitrary number as specified by an MCA parameter.


192. I know my cluster's configuration - how can I take advantage of that knowledge?

Clusters rarely change from day-to-day, and large clusters rarely change at all. If you know your cluster's configuration, there are several steps you can take to both reduce Open MPI's memory footprint and reduce the launch time of large-scale applications. These steps use a combination of build-time configuration options to eliminate components - thus eliminating their libraries and avoiding unnecessary component open/close operations - as well as run-time MCA parameters to specify what modules to use by default for most users.

One way to save memory is to avoid building components that will actually never be selected by the system. Unless MCA parameters specify which components to open, built components are always opened and tested as to whether or not they should be selected for use.\ If you know that a component can build on your system, but due to your cluster's configuration will never actually be selected, then it is best to simply configure OMPI to not build that component by using the --enable-mca-no-build configure option.

For example, if you know that your system will only utilize the "ob1" component of the PML framework, then you can no_build all the others. This not only reduces memory in the libraries, but also reduces memory footprint that is consumed by Open MPI opening all the built components to see which of them can be selected to run.

In some cases, however, a user may optionally choose to use a component other than the default. For example, you may want to build all of the routed framework components, even though the vast majority of users will simply use the default binomial component. This means you have to allow the system to build the other components, even though they may rarely be used.

You can still save launch time and memory, though, by setting the routed=binomial MCA parameter in the default MCA parameter file. This causes OMPI to not open the other components during startup, but allows users to override this on their command line or in their environment so no functionality is lost - you just save some memory and time.

Rather than have to figure this all out by hand, we are working on a new OMPI tool called ompi-profiler. When run on a cluster, it will tell you the selection results of all frameworks - i.e., for each framework on each node, which component was selected to run - and a variety of other information that will help you tailor Open MPI for your cluster. Stay tuned for more info as we continue to work on ways to improve your performance...


193. What is the Modular Component Architecture (MCA)?

The Modular Component Architecture (MCA) is the backbone for much of Open MPI's functionality. It is a series of frameworks, components, and modules that are assembled at run-time to create an MPI implementation.

Frameworks: An MCA framework manages zero or more components at run time and is targeted at a specific task (e.g., provide MPI collective operation functionality). Each MCA framework supports a single component type, but may support multiple versions of that type. The framework uses the services from the MCA base functionality to find and/or load components.

Components: An MCA component is an implementation of a framework's interface. It is a standalone collection of code that can be bundled into a plugin that can be inserted into the Open MPI code base, either at run-time and/or compile-time.

Modules: An MCA module is an instance of a component (in the C++ sense of the word "instance"; an MCA component is analogous to a C++ class). For example, if a node running an Open MPI application has multiple ethernet NICs, the Open MPI application will contain one TCP MPI point-to-point component, but two TCP point-to-point modules.

Frameworks, components, and modules can be dynamic or static. That is, they can be available as plugins or they may be compiled statically into libraries (e.g., libmpi).


194. What are MCA parameters?

MCA parameters are the basic unit of run-time tuning for Open MPI. They are simple "key = value" pairs that are used extensively throughout the code base. The general rules of thumb that the developers use are:

  • Instead of using a constant for an important value, make it an MCA parameter
  • If a task can be implemented in multiple, user-discernible ways, implement as many as possible and make choosing between them be an MCA parameter

For example, an easy MCA parameter to describe is the boundary between short and long messages in TCP wire-line transmissions. "Short" messages are sent eagerly whereas "long" messages use a rendezvous protocol. The decision point between these two protocols is the overall size of the message (in bytes). By making this value an MCA parameter, it can be changed at run-time by the user or system administrator to use a sensible value for a particular environment or set of hardware (e.g., a value suitable for 100 Mbps Ethernet is probably not suitable for Gigabit Ethernet, and may require a different value for 10 Gigabit Ethernet).

Note that MCA parameters may be set in several different ways (described in another FAQ entry). This allows, for example, system administrators to fine-tune the Open MPI installation for their hardware / environment such that normal users can simply use the default values.

More specifically, HPC environments -- and the applications that run on them -- tend to be unique. Providing extensive run-time tuning capabilities through MCA parameters allows the customization of Open MPI to each system's / user's / application's particular needs.


195. What frameworks are in Open MPI?

There are three types of frameworks in Open MPI: those in the MPI layer (OMPI), those in the run-time layer (ORTE), and those in the operating system / platform layer (OPAL).

The specific list of frameworks varies between each major release series of Open MPI. See the links below to FAQ entries for specific versions of Open MPI:


196. What frameworks are in Open MPI v1.2 (and prior)?

The comprehensive list of frameworks in Open MPI is continually being augmented. As of August 2005, here is the current list:

OMPI frameworks

  • allocator: Memory allocator
  • bml: BTL management layer (managing multiple devices)
  • btl: Byte transfer layer (point-to-point byte movement)
  • coll: MPI collective algorithms
  • io: MPI-2 I/O functionality
  • mpool: Memory pool management
  • pml: Point-to-point management layer (fragmenting, reassembly, top-layer protocols, etc.)
  • osc: MPI-2 one-sided communication
  • ptl: (outdated / deprecated) MPI point-to-point transport layer
  • rcache: Memory registration management
  • topo: MPI topology information

ORTE frameworks

  • errmgr: Error manager
  • gpr: General purpose registry
  • iof: I/O forwarding
  • ns: Name server
  • oob: Out-of-band communication
  • pls: Process launch subsystem
  • ras: Resource allocation subsystem
  • rds: Resource discovery subsystem
  • rmaps: Resource mapping subsystem
  • rmgr: Resource manager (upper meta layer for all other Resource frameworks)
  • rml: Remote messaging layer (routing of OOB messages)
  • schema: Name schemas
  • sds: Startup discovery services
  • soh: State of health

OPAL frameworks

  • maffinity: Memory affinity
  • memory: Memory hooks
  • paffinity: Processor affinity
  • timer: High-resolution timers


197. What frameworks are in Open MPI v1.3?

The comprehensive list of frameworks in Open MPI is continually being augmented. As of November 2008, here is the current list in the Open MPI v1.3 series:

OMPI frameworks

  • allocator: Memory allocator
  • bml: BTL management layer
  • btl: MPI point-to-point Byte Transfer Layer, used for MPI point-to-point messages on some types of networks
  • coll: MPI collective algorithms
  • crcp: Checkpoint/restart coordination protocol
  • dpm: MPI-2 dynamic process management
  • io: MPI-2 I/O
  • mpool: Memory pooling
  • mtl: Matching transport layer, used for MPI point-to-point messages MPI-2 one-sided communications
  • pml: MPI point-to-point management layer
  • pubsub: MPI-2 publish/subscribe management
  • rcache: Memory registration cache
  • topo: MPI topology routines

ORTE frameworks

  • errmgr: RTE error manager
  • ess: RTE environment-specfic services
  • filem: Remote file management
  • grpcomm: RTE group communications
  • iof: I/O forwarding
  • odls: OpenRTE daemon local launch subsystem
  • oob: Out of band messaging
  • plm: Process lifecycle management
  • ras: Resource allocation system
  • rmaps: Resource mapping system
  • rml: RTE message layer
  • routed: Routing table for the RML
  • snapc: Snapshot coordination

OPAL frameworks

  • backtrace: Debugging call stack backtrace support
  • carto: Cartography (host/network mapping) support
  • crs: Checkpoint and restart service
  • installdirs: Installation directory relocation services
  • maffinity: Memory affinity
  • memchecker: Run-time memory checking
  • memcpy: Memcpy copy support
  • memory: Memory management hooks
  • paffinity: Processor affinity
  • timer: High-resolution timers


198. How do I know what components are in my Open MPI installation?

The ompi_info command, in addition to providing a wealth of configuration information about your Open MPI installation, will list all components (and the frameworks that they belong to) that are available. These include system-provided components as well as user-provided components.


199. How do I install my own components into an Open MPI installation?

By default, Open MPI looks in two places for components at run-time (in order):

  1. $prefix/lib/openmpi/: This is the system-provided components directory, part of the installation tree of Open MPI itself.
  2. $HOME/.openmpi/components/: This is where users can drop their own components that will automatically be "seen" by Open MPI at run-time. This is ideal for developmental, private, or otherwise unstable components.

Note that the directories and search ordering used for finding components in Open MPI is, itself, an MCA parameter. Setting the mca_component_path changes this value (a colon-delimited list of directories).

Note also that components are only used on nodes where they are "visible." Hence, if you $prefix/lib/openmpi/ is a directory on a local disk that is not shared via a network filesystem to other nodes where you run MPI jobs, then components that are installed to that directory will only be used by MPI jobs running on the local node.

More specifically: components have the same visibility as normal files. If you need a component to be available to all nodes where you run MPI jobs, then you need to ensure that it is visible on all nodes (typically either by installing it on all nodes for non-networked filesystem installs, or by installing them in a directory that is visibile to all nodes via a networked filesystem). Open MPI does not automatically send components to remote nodes when MPI jobs are run.


200. How do I know what MCA parameters are available?

The ompi_info command can list the parameters for a given component, all the parameters for a specific framework, or all parameters. Most parameters contain a description of the parameter; all will show the parameter's current value.

For example:

shell$ ompi_info --param all all

Shows all the MCA parameters for all components that ompi_info finds, whereas:

shell$ ompi_info --param btl all

Shows all the MCA parameters for all BTL components that ompi_info finds. Finally:

shell$ ompi_info --param btl tcp

Shows all the MCA parameters for the TCP BTL component.


201. How do I set the value of MCA parameters?

There are three main ways to set MCA parameters, each of which are searched in order.

  1. Command line: The highest-precedence method is setting MCA parameters on the command line. For example:

    shell$ mpirun --mca mpi_show_handle_leaks 1 -np 4 a.out
    

    This sets the MCA parameter mpi_show_handle_leaks to the value of 1 before running a.out with four processes. In general, the format used on the command line is "--mca <param_name> <value>".

    Note that when senting multi-word values, you need to use quotes to ensure that the shell and Open MPI understand that they are a single value. For example:

    shell$ mpirun --mca param "value with multiple words" ...
    

  2. Environment variable: Next, environment variables are searched. Any environment variable named OMPI_MCA_<param_name> will be used. For example, the following has the same effect as the previous example (for sh-flavored shells):

    shell$ OMPI_MCA_mpi_show_handle_leaks=1
    shell$ export OMPI_MCA_mpi_show_handle_leaks
    shell$ mpirun -np 4 a.out
    

    Or, for csh-flavored shells:

    shell% setenv OMPI_MCA_mpi_show_handle_leaks 1
    shell% mpirun -np 4 a.out
    

    Note that setting environment variables to values with multiple words requires quoting, such as:

    # sh-flavored shells
    shell$ OMPI_MCA_param="value with multiple words"
    
    # csh-flavored shells
    shell% setenv OMPI_MCA_param "value with multiple words"
    

  3. Aggregate MCA parameter files: Simple text files can be used to set MCA parameter values for a specific application. See this FAQ entry (Open MPI version 1.3 and higher).
  4. Files: Finally, simple text files can be used to set MCA parameter values. Parameters are set one per line (comments are permitted). For example:

    # This is a comment
    # Set the same MCA parameter as in previous examples
    mpi_show_handle_leaks = 1
    

    Note that quotes are not necessary for setting multi-word values in MCA parameter files. Indeed, if you use quotes in the MCA parameter file, they will be used as part of the value itself. For example:

    # The following two values are different:
    param1 = value with multiple words
    param2 = "value with multiple words"
    

    By default, two files are searched (in order):

    1. $HOME/.openmpi/mca-params.conf: The user-supplied set of values takes the highest precedence.
    2. $prefix/etc/openmpi-mca-params.conf: The system-supplied set of values has a lower precedence.

    More specifically, the MCA parameter mca_param_files specifies a colon-delimited path of files to search for MCA parameters. Files to the left have lower precedence; files to the right are higher precedence.

    Keep in mind that, just like components, these parameter files are only relevant where they are "visible" (see this FAQ entry). Specifically, Open MPI does not read all the values from these files during startup and then send them to all nodes in the job -- the files are read on each node during each process' startup. This is intended behavior: it allows for per-node customization, which is especially relevant in heterogeneous environments.


202. What are Aggregate MCA (AMCA) parameter files?

Starting with version 1.3, aggregate MCA (AMCA) parameter files contain MCA parameter key/value pairs similar to the $HOME/.openmpi/mca-params.conf file described in this FAQ entry.

The motivation behind AMCA parameter sets came from the realization that for certain applications a large number of MCA parameters are required for the application to run well and/or as the user expects. Since these MCA parameters are application specific (or even application run specific) they should not be set in a global manner, but only pulled in as determined by the user.

MCA parameters set in AMCA parameter files will override any MCA parameters supplied in global parameter files (e.g., $HOME/.openmpi/mca-params.conf), but not command line or environment parameters.

AMCA parameter files are typically supplied on the command line via the -am option.

For example, consider a AMCA parameter file called foo.conf placed in the same directory as the application a.out. A user will typically run the application as:

shell$ mpirun -np 2 a.out

To use the foo.conf AMCA parameter file this command line changes to:

shell$ mpirun -np 2 -am foo.conf a.out

If the user wants to override a parameter set in foo.conf they can add it to the command line as seen below.

shell$ mpirun -np 2 -am foo.conf -mca btl tcp,self a.out

AMCA parameter files can be coupled if more than one file is to be used. If we have another AMCA parameter file called bar.conf that we want to use we add it to the command line as follows:

shell$ mpirun -np 2 -am foo.conf:bar.conf a.out

AMCA parameter files are loaded in priority order. This means that foo.conf AMCA file has priority over the bar.conf file. So if the bar.conf file sets the MCA parameter mpi_leave_pinned=0 and the foo.conf file sets this MCA parameter to mpi_leave_pinned=1 then the latter will be used.

The location of AMCA parameter files are resolved in a similar way as the shell. If no path operator is provided (i.e., foo.conf) then Open MPI will search the $SYSCONFDIR/amca-param-sets directory then the current working directory. If a relative path is specified then only that path will be searched (i.e., ./foo.conf, baz/foo.conf). If an absolute path is specified then only that path will be searched (i.e., /bip/boop/foo.conf).

Though the typical use case for AMCA parameter files is to be specified on the command line, they can also be set as MCA parameters in the environment. The MCA parameter (mca_base_param_file_prefix) contains a ':' separated list of AMCA parameter files exactly as they would be passed to the -am command line option. The MCA parameter (mca_base_param_file_path) specifies the path to search for AMCA files with relative paths. By default this is $SYSCONFDIR/amca-param-sets/:$CWD.


203. How do I select which components are used?

Each MCA framework has a top-level MCA parameter that helps guide which components are selected to be used at run-time. Specifically, there is an MCA parameter of the same name as each MCA framework that can be used to include or exclude components from a given run.

For example, the btl MCA parameter is used to control which BTL components are used (i.e., MPI point-to-point communications; see this FAQ entry for a full list of MCA frameworks). It can take as a value a comma-separated list of components with the optional prefix "^". For example:

# Tell Open MPI to exclude the tcp and openib BTL components
# and implicitly include all the rest
shell$ mpirun --mca btl ^tcp,openib ...

# Tell Open MPI to include *only* the components listed here and
# implicitly ignore all the rest (i.e., the loopback, shared memory,
# and OpenFabrics (a.k.a., "OpenIB") MPI point-to-point components):
shell$ mpirun --mca btl self,sm,openib ...

Note that ^ can only be the prefix of the entire value because the inclusive and exclusive behavior are mutually exclusive. Specifically, since the exclusive behavior means "use all components except these," it does not make sense to mix it with the inclusive behavior of not specifying it (i.e., "use all of these components"). Hence, something like this:

shell$ mpirun --mca btl self,sm,openib,^tcp ...

does not make sense because it says both "use only the self, sm, and openib components" and "use all components except tcp" and will result in an error.

Just as with all MCA parameters, the btl parameter (and all framework parameters) can be set in multiple different ways.


204. What is processor affinity? Does Open MPI support it?

Open MPI supports processor affinity on a variety of systems through process binding, in which each MPI process, along with its threads, is "bound" to a specific subset of processing resources (cores, sockets, etc.). That is, the operating system will constrain that process to run on only that subset. (Other processes might be allowed on the same resources.)

Affinity can improve performance by inhibiting excessive process movement -- for example, away from "hot" caches or NUMA memory. Judicious bindings can improve performance by reducing resource contention (by spreading processes apart from one another) or improving interprocess communications (by placing processes close to one another). Binding can also improve performance reproducibility by eliminating variable process placement. Unfortunately, binding can also degrade performance by inhibiting the OS capability to balance loads.

You can run the "ompi_info" command and look for "paffinity" components to see if your system is supported. For example:

$ ompi_info | grep paffinity
           MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)

Note that processor affinity probably should not be used when a node is over-subscribed (i.e., more processes are launched than there are processors). This can lead to a serious degradation in performance (even more than simply oversubscribing the node). Open MPI will usually detect this situation and automatically disable the use of processor affinity (and display run-time warnings to this effect).

Also see this FAQ entry for how to use processor and memory affinity in Open MPI.


205. What is memory affinity? Does Open MPI support it?

Memory affinity is only relevant for Non-Uniform Memory Access (NUMA) machines, such as "big iron" SGI and Cray machines, or many models of multi-processor Opteron machines. In a NUMA architecture, memory is physically distributed throughout the machine even though it is virtually treated as a single address space. That is, memory may be physically local to one or more processors -- and therefore remote to other processors.

Simply put: some memory will be faster to access (for a given process) than others.

Open MPI supports general and specific memory affinity, meaning that it generally tries to allocate all memory local to the processor that asked for it. When shared memory is used for communication, Open MPI uses memory affinity to make certain pages local to specific processes in order to minimize memory network/bus traffic.

Open MPI supports memory affinity on a variety of systems. You can run the "ompi_info" command and look for "maffinity" components to see if your system is supported. For example:

$ ompi_info | grep maffinity
           MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0)

Note that memory affinity support is enabled only when processor affinity is enabled. Specifically: using memory affinity does not make sense if processor affinity is not enabled because processes may allocate local memory and then move to a different processor, potentially remote from the memory that it just allocated.

Also see this FAQ entry for how to use processor and memory affinity in Open MPI.


206. How do I tell Open MPI to use processor and/or memory affinity?

Assuming that your system supports processor and memory affinity (check ompi_info for "paffinity" and "maffinity" components), you can explicitly tell Open MPI to use them when running MPI jobs.

Note that memory affinity support is enabled only when processor affinity is enabled. Specifically: using memory affinity does not make sense if processor affinity is not enabled because processes may allocate local memory and then move to a different processor, potentially remote from the memory that it just allocated.

Also note that processor and memory affinity is meaningless (but harmless) on uniprocessor machines.

How to enable / use processor and memory affinity in Open MPI depends on which version you are using:


207. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.2.x? (What is mpi_paffinity_alone?)

Open MPI 1.2 offers only crude control, with the MCA parameter "mpi_paffinity_alone". For example:

$ mpirun --mca mpi_paffinity_alone 1 -np 4 a.out

(Just like any other MCA parameter, mpi_paffinity_alone can be set via any of the normal MCA parameter mechanisms.)

On each node where your job is running, your job's MPI processes will be bound, one-to-one, in the order of their global MPI ranks, to the lowest-numbered processing units (for example, cores or hardware threads) on the node as identified by the OS. Further, memory affinity will also be enabled if it is supported on the node, as described in a different FAQ entry.

If multiple jobs are launched on the same node in this manner, they will compete for the same processing units and severe performance degradation will likely result. Therefore, this MCA parameter is best used when you know your job will be "alone" on the nodes where it will run.

Since each process is bound to a single processing unit, performance will likely suffer catastrophically if processes are multi-threaded.

Depending on how processing units on your node are numbered, the binding pattern may be good, bad, or even disastrous. For example, performance might be best if processes are spread out over all processor sockets on the node. The processor ID numbering, however, might lead to mpi_paffinity_alone filling one socket before moving to another. Indeed, on nodes with multiple hardware threads per core (e.g., "HyperThreads", "SMT", etc.), the numbering could lead to multiple processes being bound to a core before the next core is considered. In such cases, you should probably upgrade to a newer version of Open MPI or use a different, external mechanism for processor binding.

Note that Open MPI will automatically disable processor affinity on any node that is oversubscribed (i.e., where more Open MPI processes are launched in a single job on a node than it has processors) and will print out warnings to that effect.

Also note, however, that processor affinity is not exclusionary with Degraded performance mode. Degraded mode is usually only used when oversubscribing nodes (i.e., running more processes on a node than it has processors -- see this FAQ entry for more details about oversubscribing, as well as a definition of Degraded performance mode). It is possible manually to select Degraded performance mode and use processor affinity as long as you are not oversubscribing.


208. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.3.x? (What are rank files?)

Open MPI 1.3 supports the mpi_paffinity_alone MCA parameter that is described in this FAQ entry.

Open MPI 1.3 (and higher) also allows a different binding to be specified for each process via a rankfile. Consider the following example:

shell$ cat rankfile
rank 0=host0 slot=2
rank 1=host1 slot=4-7,0
rank 2=host2 slot=1:0
rank 3=host3 slot=1:2-3
shell$ mpirun -np 4 -hostfile hostfile --rankfile                 rankfile ./my_mpi_application
  or
shell$ mpirun -np 4 -hostfile hostfile --mca rmaps_rank_file_path rankfile ./my_mpi_application

The rank file specifies a host node and slot list binding for each MPI process in your job. Note:

  • Typically, the slot list is a comma-delimited list of ranges. The numbering is OS/BIOS-dependent and refers to the finest grained processing units identified by the OS -- for example, cores or hardware threads.
  • Alternatively, a colon can be used in the slot list for socket:core designations. For example, 1:2-3 means cores 2-3 of socket 1.
  • It is strongly recommended that you provide a full rankfile when using such affinity settings, otherwise there would be a very high probability of processor oversubscription and performance degradation.
  • The hosts specified in the rankfile must be known to mpirun, for example via a list of hosts in a hostfile or as obtained from a resource manager.
  • The number of processes np must be provided on the mpirun cmd line.
  • If some processing units are not available -- e.g., due to unpopulated sockets, idled cores, or BIOS settings -- the syntax assumes a logical numbering in which numbers are contiguous despite the physical gaps. You may refer to actual physical numbers with a "p" prefix. For example, rank 4=host3 slot=p3:2 will bind rank4 to the physical socket3 : physical core2 pair.

Rank files are also discussed on the mpirun man page.

If you want to use the same "slot list" binding for each process, presumably in cases where there is only one process per node, you can specify this slot list on the command line rather than having to use a rank file:

shell$ mpirun -np 4 -hostfile hostfile --slot-list 0:1 ./my_mpi_application

Remember, every process will use the same slot list. If multiple processes run on the same host, they will bind to the same resources -- in this case, socket0:core1, presumably oversubscribing that core and ruining performance.

Slot lists can be used to bind to multiple slots, which would be helpful for multi-threaded processes. For example:

  • Two threads per process: rank 0=host1 slot=0,1
  • Four threads per process: rank 0=host1 slot=0,1,2,3

Note that no thread will be bound to a specific slot within the list. OMPI only supports process level affinity; each thread will be bound to all of the slots within the list.


209. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.4.x? (How do I use the --by* and --bind-to-* options?)

Open MPI 1.4 supports all the same processor affinity controls as Open MPI v1.3, but also supports additional command-line binding switches to mpirun:

  • --bind-to-none: Do not bind processes. (Default)
  • --bind-to-core: Bind each MPI process to a core.
  • --bind-to-socket: Bind each MPI process to a processor socket.
  • --report-bindings: Report how the launched processes were bound by Open MPI.

In the case of cores with multiple hardware threads (e.g., HyperThreads or SMT), only the first hardware thread on each core is used with the --bind-to-* options. This will hopefully be fixed in the Open MPI v1.5 series.

The above options are typically most useful when used with the following switches that indicate how processes are to be laid out in the MPI job. To be clear: if the following options are used without a --bind-to-* option, they only have the effect of deciding which node a process will run on. Only the --bind-to-* options actually bind a process to a specific (set of) hardware resource(s).

  • --byslot: Alias for --bycore.
  • --bycore: When laying out processes, put sequential MPI processes on adjacent processor cores. (Default)
  • --bysocket: When laying out processes, put sequential MPI processes on adjacent processor sockets.
  • --bynode: When laying out processes, put sequential MPI processes on adjacent nodes.

Note that --bycore and --bysocket lay processes out in terms of the actual hardware rather than by some node-dependent numbering, which is what mpi_paffinity_alone does as described in this FAQ entry.

Finally, there is poorly-named a "combination" option that effects both process layout counting and binding: --cpus-per-proc (and an even more poorly-named alias --cpus-per-rank).

Editor's note: I feel that these options are poorly named for two reasons: 1) "cpu" is not consistently defined (e.g., it may be a core, or may be a hardware thread, or it may be something else), and 2) even though many users use the terms "rank" and "MPI process" interchangeably, they are NOT the same thing.

This option does the following:

  • Takes an integer argument (ncpus) that indicates how many operating system processor IDs (which may be cores or may be hardware threads) should be bound to each MPI process.
  • Allocates and binds ncpus OS processor IDs to each MPI process. For example, on a machine with 4 processor sockets, each with 4 processor cores, each with one hardware thread:

    shell$ mpirun -np 8 --cpus-per-proc 2 my_mpi_process
    

    This command will bind each MPI process to ncpus=2 cores. All cores on the machine will be used.
  • Note that ncpus cannot be more than the number of OS processor IDs in a single processor socket. Put loosely: --cpus-per-proc only allows binding to multiple cores/threads within a single socket.

The --cpus-per-proc can also be used with the --bind-to-* options in some cases, but this code is not well tested and may result in unexpected binding behavior. Test carefully to see where processes actually get bound before relying on the behavior for production runs. The --cpus-per-proc and other affinity-related command line options are likely to be revamped some time during the Open MPI v1.5 series.


210. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.5.x?

Open MPI 1.5 currently has the same processor affinity controls as Open MPI v1.4. This FAQ entry is a placemarker for future enhancements to the 1.5 series' processor and memory affinity features.

Stay tuned!


211. Does Open MPI support calling fork(), system(), or popen() in MPI processes?

It depends on a lot of factors, including (but not limited to):

In some cases, Open MPI will determine that it is not safe to fork(). In these cases, Open MPI will register a pthread_atfork() callback to print a warning when the process forks.

This warning is helpful for legacy MPI applications where the current maintainers are unaware that system() or popen() is being invoked from an obscure subroutine nestled deep in millions of line of Fortran code (we've seen this kind of scenario many times).

However, this atfork handler can be dangerous because there is no way to unregister an atfork handler. Hence, packages that dynamically open Open MPI's libraries (e.g., Python bindings for Open MPI) may fail if they finalize and unload libmpi, but later call fork. The atfork system will try to invoke Open MPI's atfork handler; nothing good can come of that.

For such scenarios, or if you simply want to disable printing the warning, Open MPI can be set to never register the atfork handler with the mpi_warn_on_fork MCA parameter. For example:

shell$ mpirun --mca mpi_warn_on_fork 0 ...

Of course, systems that dlopen libmpi may not use Open MPI's mpirun, and therefore may need to use a different mechanism to set MCA parameters.


212. I want to run some performance benchmarks with Open MPI. How do I do that?

Running benchmarks correctly is an extremely difficult task to do correctly. There are many, many factors to take into account; it is not as simple as just compiling and running a stock benchmark application. This FAQ entry is by no means a definitive guide, but it does try to offer some suggestions for generating accurate, meaningful benchmarks.

  1. Decide exactly what you are benchmarking and setup your system accordingly. For example, if you are trying to benchmark maximum performance, then many of the suggestions listed below are extremely relevant (be the only user on the systems and network in question, be the only software running, use processor affinity, etc.). If you're trying to benchmark average performance, some of the suggestions below may be less relevant. Regardless, it is critical to know exactly what you're trying to benchmark, and know (not guess) both your system and the benchmark application itself well enough to understand what the results mean.

    To be specific, many benchmark applications are not well understood for exactly what they are testing. There have been many cases where users run a given benchmark application and wrongfully conclude that their system's performance is bad -- solely on the basis of a single benchmark that they did not understand. Read the documentation of the benchmark carefully, and possibly even look into the code itself to see exactly what it is testing.

    Case in point: not all ping-pong benchmarks are created equal. Most users assume that a ping-pong benchmark is a ping-pong benchmark is a ping-pong benchmark. But this is not true; the common ping-pong benchmarks tend to test subtly different things (e.g., NetPIPE, TCP bench, IMB, OSU, etc.). Make sure you understand what your benchmark is actually testing.

  2. Make sure that you are the only user on the systems where you are running the benchmark to eliminate contention from other processes.
  3. Make sure that you are the only user on the entire network / interconnect to eliminate network traffic contention from other processes. This is usually somewhat difficult to do, especially in larger, shared systems. But your most accurate, repeatable results will be achieved when you are the only user on the entire network.
  4. Disable all services and daemons that are not being used. Even "harmless" daemons consume system resources (such as RAM) and cause "jitter" by occassionally waking up, consuming CPU cycles, reading or writing to disk, etc. The optimum benchmark system has an absolute minimum number of system services running.
  5. Use processor affinity on multi-processor/core machines to disallow the operating system from swapping MPI processes between processor (and causing unnecessary cache thrashing, for example).

    On NUMA architectures, having the processes getting bumped from one socket to another is more expensive in terms of cache locality (with all of the cache coherency overhead that comes with the lack of it) than in terms of hypertransport routing (see below).

    Non-NUMA architectures such as the Intel Woodcrest have a flat access time to the South Bridge, but cache locality is still important so CPU affinity is always a good thing to do.

  6. Be sure to understand your system's architecture, particularly with respect to the memory, disk, and network characteristics, and test accordingly. For example, on NUMA architectures, most common being Opteron, the South Bridge is connected through a hypertransport link to one CPU on one socket. Which socket depends on the motherboard, but it should be described in the motherboard documentation (it's not always socket 0!). If a process on the other socket needs to write something to a NIC on a PCIE bus behind the South Bridge, it needs to first hop through the first socket. On modern machines (circa late 2006), this hop cost usually something like 100ns (i.e., 0.1 us). If the socket is further away, like in a 4 or 8-socket configuration, there could potentially be more hops, leading to more latency.
  7. Compile your benchmark with the appropriate compiler optimization flags. With some MPI implementations, the compiler wrappers (like mpicc, mpif90, etc.) add optimization flags automatically. Open MPI does not. Add -O or other flags explicitly.

  8. Make sure your benchmark runs for a sufficient amount of time. Short-running benchmarks are generally less accurate because they take fewer samples; longer-running jobs tend to take more samples

  9. If your benchmark is trying to benchmark extremely short events (such as the time required for a single ping-pong of messages):

    • Perform some "warmup" events first. Many MPI implementations (including Open MPI) -- and other subsystems upon which the MPI uses -- may use "lazy" semantics to setup and maintain streams of communications. Hence, the first event (or first few events) may well take significantly longer than subsequent events.
    • Use a high-resolution timer if possible -- gettimeofday() only returns milisecond precision (sometimes on the order of several microseconds).
    • Run the event many, many times (hundreds or thousands, depending on the event and the time it takes). Not only does this provide a more samples, it may also be necessary, especially when the precision of the timer your using may be several orders of magnitude less precise than the even you're trying to benchmark.

  10. Decide whether you are reporting minimum, average, or maximum numbers, and have good reasons why.
  11. Accurately label and report all results. Reproducability is a major goal of benchmarking; benchmark results are effectively useless if they are not precisely labeled as to exactly what they are reporting. Keep a log and detailed notes about the exact system configuration that ou are benchmarking. Note, for example, all hardware and software characteristics (to include hardware, firmware, and software versions as appropriate).


213. I am getting a MPI_Win_free error from IMB-EXT -- what do I do?

When you run IMB-EXT with Open MPI, you'll see a message like this:

[node01.example.com:2228] *** An error occurred in MPI_Win_free
[node01.example.com:2228] *** on win 
[node01.example.com:2228] *** MPI_ERR_RMA_SYNC: error while executing rma sync
[node01.example.com:2228] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

This is due to a bug in the Intel MPI Benchmarks, known to be in at least versions v3.1 and v3.2. Intel was notified of this bug in May of 2009, but there hasn't been a new IMB release since then.

Here is a small patch that fixes the bug in IMB v3.2:

diff -u imb-3.2-orig/src/IMB_window.c imb-3.2-fixed/src/IMB_window.c
--- imb-3.2-orig/src/IMB_window.c     2008-10-21 04:17:31.000000000 -0400
+++ imb-3.2-fixed/src/IMB_window.c      2009-07-20 09:02:45.000000000 -0400
@@ -140,6 +140,9 @@
                          c_info->rank, 0, 1, c_info->r_data_type,
                          c_info->WIN);
           MPI_ERRHAND(ierr);
           }
+          /* Added a call to MPI_WIN_FENCE, per MPI-2.1 11.2.1 */
+          ierr = MPI_Win_fence(0, c_info->WIN);
+          MPI_ERRHAND(ierr);
           ierr = MPI_Win_free(&c_info->WIN);
           MPI_ERRHAND(ierr);
           }

And here is the corresponding patch for IMB v3.1:

Index: IMB_3.1/src/IMB_window.c
===================================================================
--- IMB_3.1/src/IMB_window.c(revision 1641)
+++ IMB_3.1/src/IMB_window.c(revision 1642)
@@ -140,6 +140,10 @@
                          c_info->rank, 0, 1, c_info->r_data_type, c_info->WIN);
           MPI_ERRHAND(ierr);
           }
+          /* Added a call to MPI_WIN_FENCE here, per MPI-2.1
+             11.2.1 */
+          ierr = MPI_Win_fence(0, c_info->WIN);
+          MPI_ERRHAND(ierr);
           ierr = MPI_Win_free(&c_info->WIN);
           MPI_ERRHAND(ierr);
 }


214. What is the sm BTL?

The sm BTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.

The sm BTL has high exclusivity. That is, if one process can reach another process via sm, then no other BTL will be considered for that connection.

Note that with OMPI 1.3.2, the sm so-called "FIFOs" were reimplemented and the sizing of the shared-memory area was changed. So, much of this FAQ will distinguish between releases up to OMPI 1.3.1 and releases starting with OMPI 1.3.2.


215. How do I specify use of sm for MPI messages?

Typically, it is unnecessary to do so; OMPI will use the best BTL available for each communication.

Nevertheless, you may use the MCA parameter btl. You should also specify the self BTL for communications between a process and itself. Further, if not all processes in your job will run on the same, single node, then you also need to specify a BTL for internode communications. For example:

shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out


216. How does the sm BTL work?

A point-to-point user message is broken up by the PML into fragments. The sm BTL only has to transfer individual fragments. The steps are:

  • The sender pulls a shared-memory fragment out of one of its free lists. Each process has one free list for smaller (e.g., 4Kbyte) eager fragments and another free list for larger (e.g., 32Kbyte) max fragments.
  • The sender packs the user-message fragment into this shared-memory fragment, including any header information.
  • The sender posts a pointer to this shared fragment into the appropriate FIFO (first-in-first-out) queue of the receiver.
  • The receiver polls its FIFO(s). When it finds a new fragment pointer, it unpacks data out of the shared-memory fragment and notifies the sender that the shared fragment is ready for reuse (to be returned to the sender's free list).

On each node where an MPI job has two or more processes running, the job creates a file that each process mmaps into its address space. Shared-memory resources that the job needs -- such as FIFOs and fragment free lists -- are allocated from this shared-memory area.


217. Why does my MPI job no longer start when there are too many processes on one node?

If you are using OMPI 1.3.1 or earlier, it is possible that the shared-memory area set aside for your job was not created large enough. Make sure you're running in 64-bit mode (compiled with -m64) and set the MCA parameter mpool_sm_max_size to be very large -- even several Gbytes. Exactly how large is discussed further below.

Regardless of which OMPI release you're using, make sure that there is sufficient space for a large file to back the shared memory -- typically in /tmp.


218. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for the sm BTL and sm mpool:

shell$ ompi_info --param  btl  sm
shell$ ompi_info --param mpool sm


219. How can I tune these parameters to improve performance?

Mostly, the default values of the MCA parameters have already been chosen to give good performance. To improve performance further is a little bit of an art. Sometimes, it's a matter of trading off performance for memory.

btl_sm_eager_limit: If message data plus header information fits within this limit, the message is sent "eagerly"... -- that is, a sender attempts to write its entire message to shared buffers without waiting for a receiver to be ready. Above this size, a sender will only write the first part of a message, then wait for the receiver to acknowledge its ready before continuing. Eager sends can improve performance by decoupling senders from receivers.

btl_sm_max_send_size: Large messages are sent in fragments of this size. Larger segments can lead to greater efficiencies, though they could perhaps also inhibit pipelining between sender and receiver.

btl_sm_num_fifos: Starting in OMPI 1.3.2, this is the number of FIFOs per receiving process. By default, there is only one FIFO per process. Conceivably, if many senders are all sending to the same process and contending for a single FIFO, there will be congestion. If there are many FIFOs, however, the receiver must poll more FIFOs to find in-coming messages. Therefore, you might try increasing this parameter slightly if you have many (at least dozens) of processes all sending to the same process. For example, if 100 senders are all contending for a single FIFO for a particular receiver, it may suffice to increase btl_sm_num_fifos from 1 to 2.

btl_sm_fifo_size: Starting in OMPI 1.3.2, FIFOs could no longer grow. If you believe the FIFO is getting congested because a process falls far behind in reading in in-coming message fragments, increase this size manually.

btl_sm_free_list_num: This is the initial number of fragments on each (eager and max) free list. The free lists can grow in response to resource congestion, but you can increase this parameter to pre-reserve space for more fragments.

mpool_sm_min_size: You can reserve headroom for the shared-memory area to grow by increasing this parameter.


220. Where is the file that sm will mmap in?

The file will be in the OMPI session directory, which is typically something like /tmp/openmpi-sessions-myusername@mynodename* . The file itself will have the name shared_mem_pool.mynodename. For example, the full path could be /tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.

To place the session directory in a non-default location, use the MCA parameter orte_tmpdir_base.


221. Why am I seeing incredibly poor performance with the sm BTL?

The most common problem with the shared memory BTL is when the Open MPI session directory is placed on a network filesystem (e.g., if /tmp is not a local disk). This is because the shared memory BTL places a memory-mapped file in the Open MPI session directory (see this entry for more details). If the session directory is located on a network filesystem, the shared memory BTL latency will be extremely high.

Try not mounting /tmp as a network filesystem, and/or moving the Open MPI session directory to a local filesystem.

Some users have reported success and possible performance optimizations with having /tmp mounted as a "tmpfs" filesystem (i.e., a RAM-based filesystem). However, before doing configuring your system this way, be aware of a few items:

  1. Open MPI writes a few small meta data files into /tmp and may therefore consume some extra memory that could have otherwise been used for application instruction or data state.
  2. If you use the "filem" system in Open MPI for moving executables between nodes, these files are stored under /tmp.
  3. Open MPI's checkpoint / restart files can also be saved under /tmp.
  4. If the Open MPI job is terminated abnormally, there are some circumstances where files (including memory-mapped shared memory files) can be left in /tmp. This can happen, for example, when a resource manager forcibly kills an Open MPI job and does not give it the chance to clean up /tmp files and directories.

Some users have reported success with configuring their resource manager to run a script between jobs to forcibly empty the /tmp directory.


222. Can I use SysV instead of mmap?

In the 1.3 and 1.4 Open MPI series, shared memory is established via mmap. In future releases, there may be an option for using SysV shared memory.


223. How much shared memory will my job use?

Your job will create a shared-memory area on each node where it has two or more processes. This area will be fixed during the lifetime of your job. Shared-memory allocations (for FIFOs and fragment free lists) will be made in this area. Here, we look at the size of that shared-memory area.

If you want just one, hard number, then go with approximately 128 Mbytes per node per job, shared by all the job's processes on that node. That is, an OMPI job will need more than a few Mbytes per node, but typically less than a few Gbytes.

Better yet, read on.

Up through OMPI 1.3.1, the shared-memory file would basically be sized:

  nbytes = n * mpool_sm_per_peer_size
  if ( nbytes < mpool_sm_min_size ) nbytes = mpool_sm_min_size
  if ( nbytes > mpool_sm_max_size ) nbytes = mpool_sm_max_size

where n is the number of processes in the job running on that particular node and the mpool_sm_* are MCA parameters. For small n, this size is typically excessive. For large n (e.g., 128 MPI processes on the same node), this size may not be sufficient for the job to start.

Starting in OMPI 1.3.2, a more sophisticated formula was introduced to model more closely how much memory was actually needed. That formula is somewhat complicated and subject to change. It guarantees that there will be at least enough shared memory for the program to start up and run. See this FAQ item to see how much is needed. Alternatively, the motivated user can examine the OMPI source code to see the formula used -- for example, here is the formula in OMPI revision SVN r20906.

OMPI 1.3.2 also uses the MCA parameter mpool_sm_min_size to set a minimum size -- e.g., so that there is not only enough shared memory for the job to start, but additionally headroom for further shared-memory allocations (e.g., of more eager or max fragments).

Once the shared-memory area is established, it will not grow further during the course of the MPI job's run.


224. How much shared memory do I need?

In most cases, OMPI will start your job with sufficient shared memory.

Nevertheless, if OMPI doesn't get you enough shared memory (e.g., you're using OMPI 1.3.1 or earlier with roughly 128 processes or more on a single node) or you want to trim shared-memory consumption, you may want to know how much shared memory is really needed.

As we saw earlier, the shared memory area contains:

  • FIFOs
  • eager fragments
  • max fragments

In general, you need only enough shared memory for the FIFOs and fragments that are allocated during MPI_Init().

Beyond that, you might want additional shared memory for performance reasons, so that FIFOs and fragment lists can grow if your program's message traffic encounters resource congestion. Even if there is no room to grow, however, your correctly written MPI program should still run to complete in the face of congestion; performance simply degrades somewhat. Note that while shared-memory resources can grow after MPI_Init(), they cannot shrink.

So, how much shared memory is needed during MPI_Init() ? You need approximately the total of:

  • FIFOs:
    • (≤ OMPI 1.3.1):     3 × n × n × pagesize
    • (≥ OMPI 1.3.2):     n × btl_sm_num_fifos × btl_sm_fifo_size × sizeof(void *)
  • eager fragments:     n × ( 2 × n + btl_sm_free_list_inc ) × btl_sm_eager_limit
  • max fragments:     n × btl_sm_free_list_num × btl_sm_max_send_size

where

  • n is the number of MPI processes in your job on the node
  • pagesize is the OS page size (4K for Linux and 8K for Solaris)
  • btl_sm_* are MCA parameters


225. How can I decrease my shared-memory usage?

There are two parts to this question.

First, how does one reduce how big the mmap file is? The answer is:

  • up to OMPI 1.3.1: reduce mpool_sm_per_peer_size, mpool_sm_min_size, and mpool_sm_max_size
  • starting with OMPI 1.3.2: reduce mpool_sm_min_size

Second, how does one reduce how much shared memory is needed? (Just making the mmap file smaller doesn't help if then your job won't start up.) The answers are:

  • For small values of n -- that is, for few processes per node -- shared-memory usage during MPI_Init() is predominantly for max free lists. So, you can reduce the MCA parameter btl_sm_max_send_size. Alternatively, you could reduce btl_sm_free_list_num, but it is already pretty small by default.
  • For large values of n -- that is, for many processes per node -- there are two cases:
    • up to OMPI 1.3.1: shared-memory usage is dominated by the FIFOs, which consume a certain number of pages. Usage is high and cannot be reduced much via MCA parameter tuning.
    • starting with OMPI 1.3.2: shared-memory usage is dominated by the eager free lists. So, you can reduce the MCA parameter btl_sm_eager_limit.


226. How do I specify to use the TCP network for MPI messages?

In general, you specify that the tcp BTL component should be used. However, note that you should also specify that the self BTL component should be used. self is for loopback communication (i.e., when an MPI process sends to itself), and is technically a different communication channel than TCP. For example:

shell$ mpirun --mca btl tcp,self ...

Failure to specify the self BTL may result in Open MPI being unable to complete send-to-self scenarios (meaning that your program will run fine until a process tries to send to itself).

Note that if the tcp BTL is available at run time (which it should be on most POSIX-like systems), Open MPI should automatically use it by default (ditto for self). Hence, it's usually unnecessary to specify these options on the mpirun command line. They are typically only used when you want to be absolutely positively definitely sure to use the specific BTL.

If you are using a high-speed network (such as Myrinet or InfiniBand), be sure to also see this FAQ entry.


227. But wait -- I'm using a high-speed network. Do I have to disable the TCP BTL?

No. Following the so-called "Law of Least Astonishment", Open MPI assumes that if you have both a TCP network and at least one high-speed network (such as Myrinet or InfiniBand), you will likely only want to use the high-speed network(s) for MPI message passing. Hence, the tcp BTL component will sense this and automatically deactivate itself.

That being said, Open MPI may still use TCP for setup and teardown information -- so you'll see traffic across your TCP network during startup and shutdwon of your MPI job. This is normal and does not affect the MPI message passing channels.


228. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for the tcp BTL component (i.e., the component that uses TCP for MPI communications):

shell$ ompi_info --param btl tcp

NOTE: Starting with the Open MPI 1.7 series, you ompi_info will only show a few MCA parameters by default. You will need to specify --level 9 (or --all) to show all MCA parameters. For example:

shell$ ompi_info --param btl tcp --level 9

or

shell$ ompi_info --all


229. Does Open MPI use the TCP loopback interface?

Usually not.

In general message passing usage, there are two scenarios where using the TCP loopback interface could be used:

  1. Sending a message from one process to itself
  2. Sending a message from one process to another process on the same machine

The TCP BTL does not handle "send-to-self" scenarios in Open MPI; indeed, it is not even capable of doing so. Instead, the self BTL component is used for all send-to-self MPI communications (this allows all Open MPI BTL components to avoid special case code for send-to-self scenarios). The self component uses its own mechanisms for send-to-self scenarios; it does not use network interfaces.

When sending to processes on the same machine, Open MPI will default to using the shared memory (sm) BTL. If the user has deactivated this BTL, depending on what other BTL components are available, it is possible that the TCP BTL will be chosen for message passing to processes on the same node, in which case the TCP lookback device will likely be used. But this is not the default; either shared memory has to fail to startup properly or the user must specifically request not to use the shared memory BTL.


230. I have multiple TCP networks on some/all of my cluster nodes. Which ones will Open MPI use?

In general, Open MPI will greedily use all TCP networks that it finds per its reachability computations.

To change this behavior, you can either specifically include certain networks or specifically exclude certain networks. See this FAQ entry for more details.


231. I'm getting TCP-related errors. What do they mean?

TCP-related errors are usually reported by Open MPI in a message similar to these:

btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113
mca_btl_tcp_frag_send: writev failed with errno=104 

If an error number is displayed with no explanation string, you can see what that specific error number means on your operating system with the following command (the following example was run on Linux; results may be different on other operating systems):

shell$ perl -e 'die$!=113'
No route to host at -e line 1.
shell$ perl -e 'die$!=104'
Connection reset by peer at -e line 1.

Two types of errors are commonly reported to the Open MPI user's mailing list:


232. How do I tell Open MPI which TCP interfaces / networks to use?

In some parallel environments, it is not uncommon to have multiple TCP interfaces on each node -- for example, one TCP network may be "slow" and used for control information such as a batch scheduler, a networked filesystem, and/or interactive logins. Another TCP network (or networks) may be "fast" and be intended for parallel applications to use during their runs. As another example, some operating systems may also have virtual interfaces for communicating with virtual machines.

Unless otherwise specified, Open MPI will greedily use all "up" TCP networks that it can find and try to connect to all peers upon demand (i.e., Open MPI does not open sockets to all of its MPI peers during MPI_INIT -- see this FAQ entry for more details). Hence, if you want MPI jobs to not use specific TCP networks -- or not use any TCP networks at all -- then you need to tell Open MPI.

NOTE: Aggressively using all "up" interfaces can cause problems in some cases. For example, if you have a machine with a local-only interface (e.g., the loopback device, or a virtual-machine bridge device that can only be used on that machine, and cannot be used to communicate with MPI processes on other machines), you will likely need to tell Open MPI to ignore these networks. Open MPI usually ignores loopback devices by default, but other local-only devices must be manually ignored. Users have reported cases where RHEL6 automatically installed a "virbr0" device for Xen virtualization. This interface was automatically given an IP address in the 192.168.1.0/24 subnet and marked as "up". Since Open MPI saw this 192.168.1.0/24 "up" interface in all MPI processes on all nodes, it assumed that that network was usable for MPI communications. This is obviously incorrect, and it led to MPI applications hanging when they tried to send or receive MPI messages.

  1. To disable Open MPI from using TCP for MPI communications, the tcp MCA parameter should be set accordingly. You can either exclude the TCP component or include all other components. Specifically:

    # This says to exclude the TCP BTL component 
    # (implicitly including all others)
    shell$ mpirun --mca btl ^tcp ...
    
    # This says to include only the listed BTL components
    # (tcp is not listed, and therefore will not be used)
    shell$ mpirun --mca btl self,sm,openib ...
    

  2. If you want to use TCP for MPI communications, but want to restrict it from certain networks, use the btl_tcp_if_include or btl_tcp_if_exclude MCA parameters (only one of the two should be set). The values of these parameters can be a comma-delimited list of network interfaces. For example:

    # This says to not use the eth0 and lo interfaces.
    # (an implicitly use all the rest)  Per the description
    # above, TCP loopback and all local-only devices *must*
    # be included if the exclude list is specified.
    shell$ mpirun --mca btl_tcp_if_exclude lo,eth0 ...
    
    # This says to only use the eth1 and eth2 interfaces
    # (and implicitly ignore the rest)
    shell$ mpirun --mca btl_tcp_if_include eth1,eth2 ...
    

  3. Starting in the Open MPI v1.5 series, you can specify subnets in the include or exclude lists in CIDR notation. For example:

    # Only use the 192.168.1.0/24 and 10.10.0.0/16 subnets for MPI
    # communications:
    shell$ mpirun --mca btl_tcp_if_include 192.168.1.0/24,10.10.0.0/16 ...
    

    NOTE: If you use the btl_tcp_if_include and btl_tcp_if_exclude MCA parametes to shape the behavior of the TCP BTL for MPI communications, you may also need/want to investigate the corresponding MCA parameters oob_tcp_if_include and oob_tcp_if_exclude, which are used to shape non-MPI TCP-based communication (e.g., communications setup and coordination during MPI_INIT and MPI_FINALIZE).

Note that Open MPI will still use TCP for control messages, such as data between mpirun and the MPI processes, rendezvous information during MPI_INIT, etc. To disable TCP altogether, you also need to disable the tcp component from the OOB framework.


233. Does Open MPI open a bunch of sockets during MPI_INIT?

Although Open MPI is likely to open multiple TCP sockets during MPI_INIT, the tcp BTL component does not open one socket per MPI peer process during MPI_INIT. Open MPI opens sockets as they are required -- so the first time a process sends a message to a peer and there is a TCP connection between the two, Open MPI will automatically open a new socket.

Hence, you should not have scalability issues with running large numbers of processes (e.g., running out of per-process file descriptors) if your parallel application is sparse in its communication with peers.


234. Are there any Linux kernel TCP parameters that I should set?

Everyone has different opinions on this, and it also depends on your exact hardware and environment. Below are general guidelines that some users have found helpful.

  1. net.ipv4.tcp_syn_retries: Some Linux systems have very large initial connection timeouts -- they retry sending SYN packets many times before determining that a connection cannot be made. If MPI is going to fail to make socket connections, it would be better for them to fail in somewhat quickly (minutes vs. hours). You might want to reduce this value to smaller value; YMMV.
  2. net.ipv4.tcp_keepalive_time: Some MPI applications send an initial burst of MPI messages (over TCP) and then send nothing for long periods of time (e.g., embarrissingly parallel applications). Linux may decide that these dormant TCP sockets are dead because it has seen no traffic on them for long periods of time. You might therefore need to lengthen the TCP inactivity timeout. Many Linux systems default to 7,200 seconds; increase it if necessary.
  3. Increase TCP buffering for 10G or 40G. Many Linux distributions come with good buffering presets for 1G Ethernet. In a datacenter/HPC cluster with 10G or 40G Ethernet NICs, this amount of kernel buffering is typically insufficient. Here's a set of parameters that some have used for good 10G/40G TCP bandwidth:

Each of the above items is a Linux kernel parameter that can be set in multiple different ways.

  1. You can change the running kernel via the /proc filesystem:

    shell# cat /proc/sys/net/ipv4/tcp_syn_retries
    5
    shell# echo 6 > /proc/sys/net/ipv4/tcp_syn_retries
    

  2. You can also use the sysctl command:

    shell# sysctl net.ipv4.tcp_syn_retries
    net.ipv4.tcp_syn_retries = 5
    shell# sysctl -w net.ipv4.tcp_syn_retries=6
    net.ipv4.tcp_syn_retries = 6
    

  3. Or you can set them by adding entries in /etc/sysctl.conf, which are persistent across reboots:

    shell# grep tcp_syn_retries /etc/sysctl.conf
    net.ipv4.tcp_syn_retries = 6
    

  4. Your Linux distro may also support you putting individual files in /etc/sysctl.d (even if that directory does not yet exist), which is actually better practice than putting them in /etc/sysctl.conf. For example:

    shell# cat /etc/sysctl.d/my-tcp-settings
    net.ipv4.tcp_syn_retries = 6
    


235. How does Open MPI know which TCP addresses are routable to each other in Open MPI 1.2?

This is a fairly complicated question -- there can be ambiguity when hosts have multiple TCP NICs and/or there are multiple TCP networks that are not routable to each other in a single MPI job.

It is important to note that Open MPI's atomic unit of routing is a process -- not an IP address. Hence, Open MPI makes connections between processes, not nodes (these processes are almost always on remote nodes, but it's still better to think in terms of processes, not nodes).

Specifically, since OMPI can span multiple TCP networks, each MPI process may be able to use multiple IP addresses to each to each other MPI process (and vice versa). So for each process, Open MPI needs to determine which IP address -- if any - to use to connect to a peer MPI process.

For example, say that you have a cluster with 16 nodes on a private ethernet network. One of these nodes doubles as the head node for the cluster and therefore has 2 ethernet NICs -- one to the external network and one to the internal cluster network. But since 16 is a nice number, you also want to use it for computation as well. So when you mpirun spanning all 16 nodes, OMPI has to figure out to not use the external NIC on the head node and only use the internal NIC.

To explain what happens, we need to explain some of what happens in MPI_INIT. Even though Open MPI only makes TCP connections between peer MPI processes upon demand (see this FAQ entry), each process publishes its TCP contact information which is then made available to all processes. Hence, every process knows the TCP address(es) and corresponding port number(s) to contact every other process.

But keep in mind that these addresses may span multiple TCP networks and/or not be routable to each other. So when a connection is requested, the TCP BTL component in Open MPI creates pairwise combinations of all the TCP addresses of the localhost to all the TCP addresses of the peer process, looking for a match.

A "match" is defined by the following rules:

  1. If the two IP addresses match after the subnet mask is applied, assume that they are mutually routable and allow the connection
  2. If the two IP addresses are public, assume that they are mutually routable and allow the connection
  3. Otherwise, the connection is disallowed (this is not an error -- we just disallow this connection on the hope that some other device can be used to make a connection)

These rules tend to cover the following scenarios:

  • A cluster on a private network with a head node that has a NIC on the private network and the public network
  • Clusters that have all public addresses

These rules do not cover the following cases:

  • Running an MPI job that spans public and private networks
  • Running an MPI job that spans a bunch of private networks with narrowly-scoped netmasks, such as nodes that have IP addresses 192.168.1.10 and 192.168.2.10 with netmasks of 255.255.255.0 (i.e., the network fabric makes these two nodes be routable to each other, even though the netmask implies that they are on different subnets).


236. How does Open MPI know which TCP addresses are routable to each other in Open MPI 1.3 (and beyond)?

The 1.3 series assumptions about routability are much different than in the 1.2 series assumption. In the 1.3 series, we assume that all interfaces are routable as long as they have the same address family, IPv4 or IPv6. We use graph theory and give each possible connection a weight depending on the quality of the connection. This allows the library to select the best connections between nodes. This method also supports striping but prevents more than one connection to any interface.

The quality of the connection is defined as follows, with a higher number meaning better connection. Note that when giving a weight to a connection consisting of a private address and a public address, it will give it the weight of PRIVATE_DIFFERENT_NETWORK.

            NO_CONNECTION = 0
PRIVATE_DIFFERENT_NETWORK = 1
PRIVATE_SAME_NETWORK      = 2
PUBLIC_DIFFERENT_NETWORK  = 3
PUBLIC_SAME_NETWORK       = 4

At this point, an example will best illustrate how two processes on two different nodes would connect up. Here we have two nodes with a variety of interfaces.

       
        NodeA                NodeB
   ---------------       ---------------
  |     lo0       |     |     lo0       |
  |  127.0.0.1    |     |  127.0.0.1    |
  |  255.0.0.0    |     |  255.0.0.0    |
  |               |     |               |
  |     eth0      |     |    eth0       |
  |   10.8.47.1   |     |   10.8.47.2   |
  | 255.255.255.0 |     | 255.255.255.0 |
  |               |     |               |
  |     ibd0      |     |     ibd0      |
  |  192.168.1.1  |     |  192.168.1.2  |
  | 255.255.255.0 |     | 255.255.255.0 |
  |               |     |               |
  |     ibd1      |     |               |
  |  192.168.2.2  |     |               |
  | 255.255.255.0 |     |               |
   ---------------       ---------------

From these two nodes, the software builds up a bipartite graph that shows all the possible connections with all the possible weights. The lo0 interfaces are excluded as the btl_tcp_if_exclude mca parameter is set to lo by default. Here is what all the possible connections with their weights look like.

     NodeA         NodeB
eth0 --------- 2 -------- eth0
    \
     \
      \------- 1 -------- ibd0

ibd0 --------- 1 -------- eth0
    \
     \
      \------- 2 -------- ibd0

ibd1 --------- 1 -------- eth0
    \
     \
      \------- 1 -------- ibd0

The library then examines all the connections and picks the optimal ones. This leaves us with two connections being established between the two nodes.

If you are curious about the actual connect() calls being made by the processes, then you can run with --mca btl_base_verbose 30. This can be useful if you notice your job hanging and believe it may be the library trying to make connections to unreachable hosts.

# Here is an example with some of the output deleted for clarity.
# One can see the connections that are attempted.
shell$ mpirun --mca btl self,sm,tcp --mca btl_base_verbose 30 -np 2 -host NodeA,NodeB a.out
[...snip...]
[NodeA:18003] btl: tcp: attempting to connect() to address 10.8.47.2 on port 59822
[NodeA:18003] btl: tcp: attempting to connect() to address 192.168.1.2 on port 59822
[NodeB:16842] btl: tcp: attempting to connect() to address 192.168.1.1 on port 44500
[...snip...]

In case you want more details about the theory behind the connection code, you can find the background story in a brief IEEE paper.


237. Does Open MPI ever close TCP sockets?

As of v1.2, no.

Although TCP sockets are opened "lazily" (meaning that MPI connections / TCP sockets are only opened upon demand -- as opposed to opening all possible sockets between MPI peer processes during MPI_INIT), they are never closed.


238. Does Open MPI support IP interfaces that have more than one IP address?

As of v1.6, no.

For example, if the output from your ifconfig has a single IP device with multiple IP addresses like this:

0: eth0:  mtu 1500 qdisc mq state UP qlen 1000
   link/ether 00:18:ae:f4:d2:29 brd ff:ff:ff:ff:ff:ff
   inet 192.168.0.3/24 brd 192.168.0.255 scope global eth0:1
   inet 10.10.0.3/24 brf 10.10.0.255 scope global eth0
   inet6 fe80::218:aef2:29b4:2c4/64 scope link 
      valid_lft forever preferred_lft forever

(note the two "inet" lines in there)

Then Open MPI will be unable to use this device.


239. Does Open MPI support virtual IP interfaces?

As of v1.6.2, no.

For exampe, if the output of your ifconfig has both "eth0" and "eth0:0", Open MPI will get confused if you use the TCP BTL, and will likely hang.

Note that using btl_tcp_if_include or btl_tcp_if_exclude to avoid using the virtual interface will not solve the issue.

This may get fixed in a future release. See Trac bug #3339 to follow the progress on this issue.


240. What Myrinet-based components does Open MPI have?

Some versions of Open MPI support both GM and MX for MPI communications.

Open MPI series GM supported MX supported
v1.0 series Yes Yes
v1.1 series Yes Yes
v1.2 series Yes Yes (BTL and MTL)
v1.3 / v1.4 series Yes Yes (BTL and MTL)
v1.5 / v1.6 series No Yes (MTL and MTL)
v1.7 / v1.8 series No Yes (MTL only)
v1.9 and beyond No No


241. How do I specify to use the Myrinet GM network for MPI messages?

In general, you specify that the gm BTL component should be used. However, note that you should also specify that the self BTL component should be used. self is for loopback communication (i.e., when an MPI process sends to itself). This is technically a different communication channel than Myrinet. For example:

shell$ mpirun --mca btl gm,self ...

Failure to specify the self BTL may result in Open MPI being unable to complete send-to-self scenarios (meaning that your program will run fine until a process tries to send to itself).

To use Open MPI's shared memory support for on-host communication instead of GM's shared memory support, simply include the sm BTL. For example:

shell$ mpirun --mca btl gm,sm,self ...

Finally, note that if the gm component is available at run time, Open MPI should automatically use it by default (ditto for self and sm). Hence, it's usually unnecessary to specify these options on the mpirun command line. They are typically only used when you want to be absolutely positively definitely sure to use the specific BTL.


242. How do I specify to use the Myrinet MX network for MPI messages?

As of version 1.2, Open MPI has two different components to support Myrinet MX, the mx BTL and the mx MTL, only one of which can be used at a time. Prior versions only have the mx BTL.

If available, the mx BTL is used by default. However, to be sure it is selected you can specify it. Note that you should also specify the self BTL component (for loopback communication) and the sm BTL component (for on-host communication). For example:

shell$ mpirun --mca btl mx,sm,self ...

To use the mx MTL component, it must be specified. Also, you must use the cm PML component. For example:

shell$ mpirun --mca mtl mx --mca pml cm ...

Note that one cannot use both the mx MTL and the mx BTL components at once. Deciding which to use largely depends on the application being run.


243. But wait -- I also have a TCP network. Do I need to explicitly disable the TCP BTL?

No. See this FAQ entry for more details.


244. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for the gm and mx BTL components and the mx MTL component:

# Show the gm BTL parameters
shell$ ompi_info --param btl gm

# Show the mx BTL parameters
shell$ ompi_info --param btl mx

# Show the mx MTL parameters
shell$ ompi_info --param mtl mx


245. I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help?

In order for us to help you, it is most helpful if you can run a few steps before sending an e-mail to both perform some basic troubleshooting and provide us with enough information about your environment to help you. Please include answers to the following questions in your e-mail:

  1. Which Myricom software stack are you running: GM or MX? Which version?
  2. Are you using "fma", the "gm_mapper", or the "mx_mapper"?
  3. If running GM, include the output from running the gm_board_info from a known "good" node and a known "bad" node.
    If running MX, include the output from running mx_info from a known "good" node and a known "bad" node.
    • Is the "Map version" value from this output is the same across all nodes?
    • NOTE: If the map version is not the same, ensure that you are not running a mixture of FMA on some nodes and the mapper on others. Also check the connectivity of nodes that seem to have an inconsistent map version.

  4. What are the contents of the file /var/run/fms/fma.log?

Gather up this information and see this page about how to submit a help request to the user's mailing list.


246. How do I adjust the MX first fragment size? Are there constraints?

The MX library limits the maximum message fragment size for both on-node and off-node messages. As of MX v1.0.3, the inter-node maximum fragment size is 32k, and the intra-node maximum fragment size is 16k -- fragments sent larger than these sizes will fail.

Open MPI automatically fragments large messages; it currently limits its first fragment size on MX networks to the lower of these two values -- 16k. As such, increasing the value of the MCA parameter named btl_mx_first_frag_size larger than 16k may cause failures in some cases (i.e., when using MX to send large messages to processes on the same node); it will cause failures in all cases if it is set above 32k.

Note that this only affects the first fragment of messages; latter fragments do not have this size restriction. The MCA parameter btl_mx_max_send_size can be used to vary the maximum size of subsequent fragments.


247. What versions of Open MPI contain support for uDAPL?

The following versions of Open MPI contain support for uDAPL:

Open MPI series uDAPL supported
v1.0 series No
v1.1 series No
v1.2 series Yes
v1.3 / v1.4 series Yes
v1.5 / v1.6 series Yes
v1.7 and beyond No


248. What is different between Sun Microsystems ClusterTools 7 and Open MPI in regards to the uDAPL BTL?

Sun's ClusterTools is based off of Open MPI with one significant difference: Sun's ClusterTools includes uDAPL RDMA capabilities in the uDAPL BTL. Open MPI v1.2 uDAPL BTL does not include the RDMA capabilities. These improvements do exist today in the Open MPI trunk and will be included in future Open MPI releases.


249. What values are expected to be used by the btl_udapl_if_include and btl_udapl_if_exclude mca parameter?

The uDAPL BTL looks for a match from the uDAPL static registry which is contained in the dat.conf file. Each non commented or blank line is considered an interface. The first field of each interface entry is the value which must be supplied to the mca parameter in question.

Solaris Example:

shell% datadm -v
ibd0  u1.2  nonthreadsafe  default  udapl_tavor.so.1  SUNW.1.0  " "  "driver_name=tavor"
shell% mpirun --mca btl_udapl_if_include ibd0 ...

Linux Example:

shell% cat /etc/dat.conf
OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so dapl.1.2 "ib0 0" ""
OpenIB-bond u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so dapl.1.2 "bond0 0" ""
shell% mpirun --mca btl_udapl_if_exclude OpenIB-bond ...


250. Where is the static uDAPL Registry found?

Solaris: /etc/dat/dat.conf

Linux: /etc/dat.conf


251. How come the value reported by "ifconfig" is not accepted by the btl_udapl_if_include/btl_udapl_if_exclude MCA parameter?

uDAPL queries a static registry defined in the dat.conf file to find available interfaces which can be used. As such, the uDAPL BTL needs to match the names found in the registry and these may differ from what is reported by "ifconfig".


252. I get a warning message about not being able to register memory and possibly out of privileged memory while running on Solaris, what can I do?

The error message probably looks something like this:

WARNING: The uDAPL BTL is not able to register memory. Possibly out of
allowed privileged memory (i.e. memory that can be pinned). Increasing
the allowed privileged memory may alleviate this issue.

One thing to do is increase the amount of available privileged memory. On Solaris your system adminstrator can increase the amount of available privileged memory by editing the /etc/project file on the nodes. For more information see Solaris "project" man page.

shell% man project

As an example of increasing the privileged memory first determine the amount available (example of typical value is 978MB):

shell% prctl -n project.max-device-locked-memory -i project default
NAME    PRIVILEGE       VALUE    FLAG   ACTION          RECIPIENT
project.max-device-locked-memory
        privileged       978MB      -   deny            -
        system          16.0EB    max   deny            -

To increase the amount of privileged memory edit /etc/project file:

Default /etc/project file.

system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::

Change to, for example 4GB.

system:0::::
user.root:1::::
noproject:2::::
default:3::::project.max-device-locked-memory=(priv, 4294967296, deny) 
group.staff:10::::


253. What is special about MPI performance analysis?

The synchronization among the MPI processes can be a key performance concern. For example, if a serial program spends a lot of time in function foo(), you should optimize foo(). In contrast, if an MPI process spends a lot of time in MPI_Recv(), not only is the optimization target probably not MPI_Recv(), but you should in fact probably be looking at some other process altogether. You should ask, "What is happening on other processes when this process has the long wait?"

Another issue is that a parallel program (in the case of MPI, a multi-process program) can generate much more performance data than a serial program due to the greater number of execution threads. Managing that data volume can be a challenge.


254. What are "profiling" and "tracing"?

These terms are sometimes used to refer to two different kinds of performance analysis.

In profiling, one aggregates statistics at run time -- e.g., total amount of time spent in MPI, total number of messages or bytes sent, etc. Data volumes are small.

In tracing, an event history is collected. It is common to display such event history on a timeline display. Tracing data can provide much interesting detail, but data volumes are large.


255. How do I sort out busy wait time from idle wait, user time from system time, and so on?

Don't.

MPI synchronization delays, which are key performance inhibitors you will probably want to study, can show up as user or system time, all depending on the MPI implementation, the type of wait, what run-time settings you have chosen, etc. In many cases, it makes most sense for you just to distinguish between time spent inside MPI from time spent outside MPI. Elapsed wallclock time will probably be your key metric. Exactly how the MPI implementation spends time waiting is less important.


256. What is PMPI?

PMPI refers to the MPI standard profiling interface.

Each standard MPI function can be called with an MPI_ or PMPI_ prefix. For example, you can call either MPI_Send() or PMPI_Send(). This feature of the MPI standard allows one to write functions with the MPI_ prefix that call the equivalent PMPI_ function. Specifically, a function so written has the behavior of the standard function plus any other behavior one would like to add. This is important for MPI performance analysis in at least two ways.

First, many performance analysis tools take advantage of PMPI. They capture the MPI calls made by your program. They perform the associated message-passing calls by calling PMPI functions, but also capture important performance data.

Second, you can use such wrapper functions to customize MPI behavior. E.g., you can add barrier operations to collective calls, write out diagnostic information for certain MPI calls, etc.

OMPI generally layers the various function interfaces as follows:

  • Fortran MPI_ interfaces are weak symbols for ...
  • Fortran PMPI_ interfaces, which call ...
  • C MPI_ interfaces, which are weak symbols for ...
  • C PMPI_ interfaces, which provide the specified functionality.

Since OMPI generally implements MPI functionality for all languages in C, you only need to provide profiling wrappers in C, even if your program is in another programming language. Alternatively, you may write the wrappers in your program's language, but if you provide wrappers in both languages then both sets will be invoked.

There are a handful of exceptions. For example, MPI_ERRHANDLER_CREATE() in Fortran does not call MPI_Errhandler_create(). Instead, it calls some other low-level function. Thus, to intercept this particular Fortran call, you need a Fortran wrapper.

Be sure you make the library dynamic. A static library can experience the linker problems described in the Complications section of the Profiling Interface chapter of the MPI standard.

See the section on Profiling Interface in the MPI standard for more details.


257. Should I use those switches --enable-mpi-profile and --enable-trace when I configure OMPI?

Probably not.

The --enable-mpi-profile switch enables building of the PMPI interfaces. While this is important for performance analysis, this setting is already turned on by default.

The --enable-trace enables internal tracing of OMPI/ORTE/OPAL calls. It is used only for developer debugging, not MPI application performance tracing.


258. What support does OMPI have for performance analysis?

The OMPI source base has some instrumentation to capture performance data, but that data must be analyzed by other non-OMPI tools.

PERUSE was a proposed MPI standard that gives information about low-level behavior of MPI internals. Check the PERUSE web site for any information about analysis tools. When you configure OMPI, be sure to use --enable-peruse. Information is available describing its integration with OMPI.

Unfortunately, PERUSE didn't win standardization, so it didn't really go anywhere. Open MPI may drop PERUSE support at some point in the future.

MPI-3 standardised the MPIT tools interface API (see Chapter 14 in the MPI-3.0 specification). As of v1.6.3, Open MPI does not yet support this interface, but it is actively being developed. It is expected that Open MPI will include a full implementation of MPIT in a future release.

VampirTrace traces the entry to and exit from the MPI layer, along with important performance data, writing data using the open OTF format. VT is available freely and can be used with any MPI. Information is available describing its integration with OMPI.


259. How do I view VampirTrace output?

While OMPI includes VampirTrace instrumentation, it does not provide a tool for viewing OTF trace data. There is simply a primitive otfdump utility in the same directory where other OMPI commands (mpicc, mpirun, etc.) are located.

Another simple utility, otfprofile, comes with OTF software and allows you to produce a short profile in LaTeX format from an OTF trace.

The main way to view OTF data is with the Vampir tool. Evaluation licenses are available.


260. Are there MPI performance analysis tools for OMPI that I can download for free?

The OMPI distribution includes no such tools, but some general MPI tools can be used with OMPI.

...we used to maintain a list of links here. But the list changes over time; projects come, and projects go. Your best bet these days is simply to use Google to find MPI tracing and performance analysis tools.


261. Any other kinds of tools I should know about?

Well, there are other tools you should consider. Part of performance analysis is not just analyzing performance per se, but generally understanding the behavior of your program.

As such, debugging tools can help you step through or pry into the execution of your MPI program. Popular tools include TotalView, which can be downloaded for free trial use, and Allinea DDT which also provides evaluation copies.

The command-line job inspection tool padb has been ported to orte and OMPI


262. How does Open MPI handle HFS+ / UFS filesystems?

Generally, Open MPI does not care whether it is running from an HFS+ or UFS filesystem. However, the C++ wrapper compiler historically has been called mpiCC, which of course is the same file as mpicc when running on HFS+. During the configure process, Open MPI will attempt to determine if the build filesystem is case sensitive or not, and assume the install file system is the same way. Generally, this is all that is needed to deal with HFS+.

However, if you are building on UFS and installing to HFS+, you should specify --without-cs-fs to configure to make sure Open MPI does not build the mpiCC wrapper. Likewise, if you build on HFS+ and install to UFS, you may want to specify --with-cs-fs to ensure that mpiCC is installed.


263. How do I use the Open MPI wrapper compilers in XCode?

XCode has a non-public interface for adding compilers to XCode. A friendly Open MPI user sent in configuration file for XCode 2.3, MPICC.pbcompspec, which will add support for the Open MPI wrapper compilers. The file should be placed in /Library/Application Support/Apple/Developer Tools/Specifications/. Upon starting XCode, this file is loaded and added to the list of known compilers.

To use the mpicc compiler, open the project, get info on the target, click the rules tab, and add a new entry. Change the process rule for "C source files" and select using MPICC.

Before moving the file, the ExecPath parameter should be set to the location of the Open MPI install. The BasedOn parameter should be updated to refer to the compiler version that mpicc will invoke -- generally gcc-4.0 on OS X 10.4 machines.

Thanks to Karl Dockendorf for this information.


264. How do I run jobs under XGrid?

XGrid support is included in Open MPI and will be build if the XGrid tools are installed.

We unfortunately have little documentation on how to run with XGrid at this point other than a fairly lengthy e-mail that Brian Barrett wrote on the Open MPI user's mailing list:

Since Open MPI 1.1.2, we also support authentication using Kerberos. The process is essentially the same, but there is no need to specify the XGRID_PASSWORD field. Open MPI applications will then run as the authenticated user, rather than nobody.


265. Where do I get more information about running under XGrid?

Please write to us on the user's mailing list. Hopefully any replies that we send will contain enough information to create proper FAQ's about how to use Open MPI with XGrid.


266. Is Open MPI included in OS X?

Open MPI v1.2.3 was included in OS X starting with version 10.5 (Leopard). Note that the Leopard does not include a Fortran compiler, so the OS X-shipped version of Open MPI does not include Fortran support.

If you need/want Fortran support, you will need to build your own copy of Open MPI (assumedly when you have a Fortran compiler installed). The Open MPI team strongly recomends not overwriting the OS X-installed version of Open MPI, but rather installing it somewhere else (e.g., /opt/openmpi).


267. How do I not use the OS X-bundled Open MPI?

There are a few reasons you might not want to use the OS X-bundled Open MPI, such as wanting Fortran support, upgrading to a new version, etc.

If you wish to use a community version of Open MPI, You can download and build Open MPI on OS X just like any other supported platform. We strongly recomend not replacing the OS X-installed Open MPI, but rather installing to an alternate location (such as /opt/openmpi).

Once you successfully install Open MPI, ensure to prefix your PATH with the bindir of Open MPI. This will ensure that you are using your newly-installed Open MPI, not the OS X-installed Open MPI. For example:

# Not showing the complete URL/tarball name because it changes over time :-)
shell$ wget http://www.open-mpi.org/.../open-mpi....
shell$ tar zxf openmpi-...gz
shell$ cd openmpi-...
shell$ ./configure --prefix=/opt/openmpi 2>&1 | tee config.out
[...lots of output...]
shell$ make -j 4 2>&1 | tee make.out
[...lots of output...]
shell$ sudo make install 2>&1 | tee install.out
[...lots of output...]
shell$ export PATH=/opt/openmpi/bin:$PATH
shell$ ompi_info
[...see output from newly-installed Open MPI...]

Of course, you'll want to make your PREFIX changes permanent. One way to do this is to edit your shell startup files.

Note that there is no need to add Open MPI's libdir to LD_LIBRARY_PATH; Open MPI's shared library build process automatically uses the "rpath" mechanism to automatically find the correct shared libraries (i.e., the ones associated with this build, vs., for example, the OS X-shipped OMPI shared libraries). Also note that we specifically do not recommend adding Open MPI's libdir to DYLD_LIBRARY_PATH.

If you build static libraries for Open MPI, there is an ordering problem such that /usr/lib/libmpi.dylib will be found before $libdir/libmpi.a, and therefore user-linked MPI applications that use mpicc (and friends) will use the "wrong" libmpi. This can be fixed by editing OMPI's wrapper compilers to force the use of the Right libraries, such as with the following flag when configuring Open MPI:

shell$ ./configure --with-wrapper-ldflags="-Wl,-search_paths_first" ...


268. Is AIX a supported operating system for Open MPI?

No. AIX used to be supported, but none of the current Open MPI developers has any platforms that require AIX support for Open MPI.

Since Open MPI is an open source project, its features and requirements are driven by the union of its developers. Hence, AIX support has fallen away because none of us currently use AIX. All this means that is we do not develop or test on AIX; there is no fundamental technology reason why Open MPI couldn't be supported on AIX.

AIX support could certainly be re-instated if someone who wanted AIX support joins the core group of developers and contributes the development and testing to support AIX.


269. Does Open MPI work on AIX?

There have been reports from random users that a small number of changes are required to the Open MPI code base to make it work under AIX. For example, see the following post on the Open MPI user's list, reported by Ricardo Fonseca:


270. What is VampirTrace?

VampirTrace is a program tracing package that can collect a very fine grained event trace of your sequential or parallel program. The traces can be visualized by the Vampir tool and a number of other tools that read the Open Trace Format (OTF).

Tracing is interesting for performance analysis and optimization of parallel and HPC (High Performance Computing) applications in general and MPI programs in particular. In fact, that's where the letters 'mpi' in Vampir come from. Therefore, it is integrated into Open MPI for convenience.

VampirTrace is included in Open MPI v1.3 and later.

VampirTrace consists of two main components: Firstly, the instrumentation part which slightly modifies the target program in order to be notified about run-time events of interest. Simply replace the compiler wrappers to activate it: mpicc to mpicc-vt, mpicxx to mpicxx-vt and so on (note that the *-vt variants of the wrapper compilers are unavailable before Open MPI v1.3). Secondly, the run-time measurement part is responsible for data collection. This can only be effective when the first part was performed -- otherwise there will be no effect on your program at all.

VampirTrace has been developed at ZIH, TU Dresden in collaboration with the KOJAK project from JSC/FZ Juelich and is available as open source software under BSD license, see ompi/contrib/vt/vt/COPYING.

The software is also available as a stand-alone source code package. The latest version can always be found at http://www.tu-dresden.de/zih/vampirtrace/.


271. Where can I find the complete documentation of VampirTrace?

A complete documentation of VampirTrace comes with the Open MPI software package as PDF and HTML (in Open MPI v1.3 and later). You can find it in the Open MPI source tree ompi/contrib/vt/vt/doc/ or after installing Open MPI in $(install-prefix)/share/vampirtrace/doc/.


272. How to instrument my MPI application with VampirTrace?

All the necessary instrumentation of user functions as well as MPI and OpenMP events is handled by special compiler wrappers ( mpicc-vt, mpicxx-vt, mpif77-vt, mpif90-vt ). Unlike the normal wrappers ( mpicc and friends) these wrappers call VampirTrace's compiler wrappers ( vtcc, vtcxx, vtf77, vtf90 ) instead of the native compilers. The vt* wrappers use underlying platform compilers to perform the necessary instrumentation of the program and link the suitable VampirTrace library.

Original:

shell$ mpicc -c hello.c -o hello

With instrumentation:

shell$ mpicc-vt -c hello.c -o hello

For your application, simply change the compiler definitions in your Makefile(s):

# original definitions in Makefile
## CC=mpicc
## CXX=mpicxx
## F90=mpif90

# replace with
CC=mpicc-vt
CXX=mpicxx-vt
F90=mpif90-vt


273. Does VampirTrace cause overhead to my application?

By using the default MPI compiler wrappers ( mpicc etc.) your application will be run without any changes at all. The VampirTrace compiler wrappers ( mpicc-vt etc.) link the VampirTrace library which intercepts MPI calls and some user level function/subroutine calls. This causes a certain amount of runtime overhead to applications. Usually, the overhead is reasonably small (0.x% - 5%) and VampirTrace by default enables precautions to avoid excessive overhead. However, it can be configured to produce very substantial overhead using non-default settings.


274. How can I change the underlying compiler of the mpi*-vt wrappers?

Unlike the standard MPI compiler wrappers ( mpicc etc.) the environment variables OMPI_CC, OMPI_CXX, OMPI_F77, OMPI_F90 do not affect the VampirTrace compiler wrappers. Please, use the environment variables VT_CC, VT_CXX, VT_F77, VT_F90 instead. In addition, you can set the compiler with the wrapper's option -vt:[cc|cxx|f77|f90]

The following two are equivalent, setting the underlying compiler to gcc:

shell$ VT_CC=gcc mpicc-vt -c hello.c -o hello
shell$ mpicc-vt -vt:cc gcc -c hello.c -o hello

Futhermore, you can modify the default settings in /share/openmpi/mpi*-wrapper-data.txt.


275. How can I pass VampirTrace related configure options through the Open MPI configure?

To give options to the VampirTrace configure script you can add these to the configure option --with-contrib-vt-flags.

The following example passes the options --with-papi-lib-dir and --with-papi-lib to the VampirTrace configure script to specify the location and the name of the PAPI library:

shell$ ./configure --with-contrib-vt-flags='--with-papi-lib-dir=/usr/lib64 --with-papi-lib=-lpapi64' ...


276. How to disable the integrated VampirTrace, completely?

By default, the VampirTrace part of Open MPI will be built and installed. If you would like to disable building and installing of VampirTrace add the value vt to the configure option --enable-contrib-no-build.

shell$ ./configure --enable-contrib-no-build=vt ...


277. v1.7 Series

  1. 1.7.3