Open MPI logo

FAQ:
Tuning the run-time characteristics of MPI Myrinet communications

  |   Home   |   Support   |   FAQ   |   all just the FAQ

Table of contents:

  1. What Myrinet-based components does Open MPI have?
  2. How do I specify to use the Myrinet GM network for MPI messages?
  3. How do I specify to use the Myrinet MX network for MPI messages?
  4. But wait -- I also have a TCP network. Do I need to explicitly disable the TCP BTL?
  5. How do I know what MCA parameters are available for tuning MPI performance?
  6. I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help?
  7. How do I adjust the MX first fragment size? Are there constraints?


1. What Myrinet-based components does Open MPI have?

Some versions of Open MPI support both GM and MX for MPI communications.

Open MPI series GM supported MX supported
v1.0 series Yes Yes
v1.1 series Yes Yes
v1.2 series Yes Yes (BTL and MTL)
v1.3 / v1.4 series Yes Yes (BTL and MTL)
v1.5 / v1.6 series No Yes (MTL and MTL)
v1.7 / v1.8 series No Yes (MTL only)
v1.9 and beyond No No


2. How do I specify to use the Myrinet GM network for MPI messages?

In general, you specify that the gm BTL component should be used. However, note that you should also specify that the self BTL component should be used. self is for loopback communication (i.e., when an MPI process sends to itself). This is technically a different communication channel than Myrinet. For example:

shell$ mpirun --mca btl gm,self ...

Failure to specify the self BTL may result in Open MPI being unable to complete send-to-self scenarios (meaning that your program will run fine until a process tries to send to itself).

To use Open MPI's shared memory support for on-host communication instead of GM's shared memory support, simply include the sm BTL. For example:

shell$ mpirun --mca btl gm,sm,self ...

Finally, note that if the gm component is available at run time, Open MPI should automatically use it by default (ditto for self and sm). Hence, it's usually unnecessary to specify these options on the mpirun command line. They are typically only used when you want to be absolutely positively definitely sure to use the specific BTL.


3. How do I specify to use the Myrinet MX network for MPI messages?

As of version 1.2, Open MPI has two different components to support Myrinet MX, the mx BTL and the mx MTL, only one of which can be used at a time. Prior versions only have the mx BTL.

If available, the mx BTL is used by default. However, to be sure it is selected you can specify it. Note that you should also specify the self BTL component (for loopback communication) and the sm BTL component (for on-host communication). For example:

shell$ mpirun --mca btl mx,sm,self ...

To use the mx MTL component, it must be specified. Also, you must use the cm PML component. For example:

shell$ mpirun --mca mtl mx --mca pml cm ...

Note that one cannot use both the mx MTL and the mx BTL components at once. Deciding which to use largely depends on the application being run.


4. But wait -- I also have a TCP network. Do I need to explicitly disable the TCP BTL?

No. See this FAQ entry for more details.


5. How do I know what MCA parameters are available for tuning MPI performance?

The ompi_info command can display all the parameters available for the gm and mx BTL components and the mx MTL component:

# Show the gm BTL parameters
shell$ ompi_info --param btl gm

# Show the mx BTL parameters
shell$ ompi_info --param btl mx

# Show the mx MTL parameters
shell$ ompi_info --param mtl mx


6. I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help?

In order for us to help you, it is most helpful if you can run a few steps before sending an e-mail to both perform some basic troubleshooting and provide us with enough information about your environment to help you. Please include answers to the following questions in your e-mail:

  1. Which Myricom software stack are you running: GM or MX? Which version?
  2. Are you using "fma", the "gm_mapper", or the "mx_mapper"?
  3. If running GM, include the output from running the gm_board_info from a known "good" node and a known "bad" node.
    If running MX, include the output from running mx_info from a known "good" node and a known "bad" node.
    • Is the "Map version" value from this output is the same across all nodes?
    • NOTE: If the map version is not the same, ensure that you are not running a mixture of FMA on some nodes and the mapper on others. Also check the connectivity of nodes that seem to have an inconsistent map version.

  4. What are the contents of the file /var/run/fms/fma.log?

Gather up this information and see this page about how to submit a help request to the user's mailing list.


7. How do I adjust the MX first fragment size? Are there constraints?

The MX library limits the maximum message fragment size for both on-node and off-node messages. As of MX v1.0.3, the inter-node maximum fragment size is 32k, and the intra-node maximum fragment size is 16k -- fragments sent larger than these sizes will fail.

Open MPI automatically fragments large messages; it currently limits its first fragment size on MX networks to the lower of these two values -- 16k. As such, increasing the value of the MCA parameter named btl_mx_first_frag_size larger than 16k may cause failures in some cases (i.e., when using MX to send large messages to processes on the same node); it will cause failures in all cases if it is set above 32k.

Note that this only affects the first fragment of messages; latter fragments do not have this size restriction. The MCA parameter btl_mx_max_send_size can be used to vary the maximum size of subsequent fragments.