Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Process mapping and affinity on the devel trunk
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-12-11 00:14:30

Hello all

If you are using the developer's trunk or nightly tarball, or are interested in new mapping and binding options that will be included in the next feature series (1.7), then please read on. If not, then please ignore.

People have raised the question of "the trunk isn't binding processes any more" a couple of times recently. OMPI's mapping, ranking, and binding options underwent a major change on the developer's trunk a few weeks ago. This was done to provide a greater range of options for process placement and binding. Although this was mentioned on the devel mailing list awhile ago, I thought a general message might be in order, especially for those users out there who are working with the trunk.

Most importantly, under the new system, opal_paffinity_alone (and its pseudonym, mpi_paffinity_alone) was disabled - it no longer does anything. I have added a warning so that any setting of that parameter will warn you of this situation. This is more than likely the reason why you are not seeing processes bound.

That option has been replaced by the --bind-to <foo> option, where <foo> can be none, hardware thread (hwthread), core, L1 cache (l1cache), L2 cache (l2cache), L3 cache (l3cache), socket, or numa region. This can also be set as an MCA parameter "hwloc_base_binding_policy". There are two allowed qualifiers to the binding option:

* if-supported - binding will be done if the system supports it. If the system does not support it, the application will execute unbound without issuing a warning - otherwise, an error message will be emitted and the execution aborted.

* overload-allowed - if the binding results in more processes than cpus being bound to a resource (e.g., if 4 processes are bound to a socket that only has 2 cpus), then execution will be terminated with an error unless this qualifier is provided.

Mapping was also expanded to support mapping by all the same locations via the --map-by <foo> option, plus two additional locations: slot (default) and node. The option is also available as MCA parameter "rmaps_base_mapping_policy". The mapping option has three qualifiers:

* span - treat all allocated nodes as if they were a single node - i.e., map across all specified resources before looping around and placing the next layer of processes on them. The default is to loop across all resources on each node until that node is completely filled before moving to the next node, so the "span" qualifier acts to balance the load across the allocation.

* oversubscribe - allow more processes than allocated slots to be mapped onto a node. This is the default for user-specified allocations (i.e., by hostfile or -host).

* nooversubscribe - error out if more processes than allocated slots are mapped onto a node. This is the default for resource managed allocations (e.g., specified by SLURM or MOAB).

Another mapper was also added to the system. The "ppr" mapper takes a string argument detailing the number of processes to be placed on each resource, with the supported resources again including all those specified above. For example, a string of "4:node,2:socket,1:core" would tell the mapper to place one process on every core in the allocation, with a maximum of 2 on each socket and 4 on each node.

Assigning process ranks has a corresponding --rank-by <foo> option, with all the same values for <foo> as found for mapping (including the use of "slot" as the default). This option is available thru the MCA parameter "rmaps_base_ranking_policy". The ranking option has two qualifiers:

* span - similar to the mapping qualifier, this causes the ranks to be assigned across all specified resources as if they were a single node

* fill - assign ranks sequentially to all processes on the given resource before moving to the next one, filling all such resources on each node before moving to the next.

Please note that several convenience options were retained for backward compatibility:

* --pernode, --npernode N, --npersocket N: the npersocket option now binds the processes to their mapped socket unless another binding option was specified

* --bind-to-core, --bind-to-socket

* --bynode, --byslot

All three options (mapping, ranking, binding) can be used in any combination. Thus, you can assign a mapping pattern, pick any option for assigning ranks, and pick any option for binding. For example, you could map-by socket, rank-by core, and bind-to numa. As a result, there are a very large number of ways to arrange your application.

I realize all this flexibility can be confusing and a little overwhelming. I am working to provide more documentation on the OMPI wiki site, but it isn't done yet. I will let people know when it is completed.