Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] RFC: make hwloc first-class data
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-09-22 21:08:36

WHAT: Make hwloc a 1st class item in OMPI

WHY: At least 2 pieces of new functionality want/need to use the hwloc data

WHERE: Put it in ompi/hwloc

WHEN: Some time in the 1.5 series

TIMEOUT: Tues teleconf, Oct 5 (about 2 weeks from now)


A long time ago, I floated the proposal of putting hwloc at the top level in opal so that parts of OPAL/ORTE/OMPI could use the data directly. I didn't have any concrete suggestions at the time about what exactly would use the hwloc data -- just a feeling that "someone" would want to.

There are now two solid examples of functionality that want to use hwloc data directly:

1. Sandia + ORNL are working on a proposal for MPI_COMM_SOCKET, MPI_COMM_NUMA_NODE, MPI_COMM_CORE, ...etc. (those names may not be the right ones, but you get the idea). That is, pre-defined communicators that contain all the MPI procs on the same socket as you, the same NUMA node as you, the same core as you, ...etc.

2. INRIA presented a paper at Euro MPI last week that takes process distance to NICs into account when coming up with the long-message splitting ratio for the PML. E.g., if we have 2 openib NICs with the same bandwidth, don't just assume that we'll split long messages 50-50 across both of them. Instead, use NUMA distances to influence calculating the ratio. See the paper here:

A previous objection was that we are increasing our dependencies by making hwloc be a 1st-class entity in OPAL -- we're hosed if hwloc ever goes out of business. Fair enough. But that being said, hwloc is getting a bit of a community growing around it: vendors are submitting patches for their hardware, distros are picking it up, etc. I certainly can't predict the future, but hwloc looks in good shape for now. There is a little risk in depending on hwloc, but I think it's small enough to be ok.

Cisco does need to be able to compile OPAL/ORTE without hwloc, however (for embedded environments where hwloc simply takes up space and adds no value). I previously proposed wrapping a subset of the hwloc API with opal_*() functions. After thinking about that a bit, that seems like a lot of work for little benefit -- how does one decide *which* subset of hwloc should be wrapped?

Instead, it might be worthwhile to simply put hwloc up in ompi/hwloc (instead of opal/hwloc). Indeed, the 2 places that want to use hwloc are up in the MPI layer -- I'm guessing that most functionality that wants hwloc will be up in MPI. And if we do the build system right, we can have paffinity/hwloc and libmpi's hwloc all link against the same libhwloc_embedded so that:

a) there's no duplication in the process, and
b) paffinity/hwloc can still be compiled out with the usual mechanisms to avoid having hwloc in OPAL/ORTE for embedded environments

(there's a little hand-waving there, but I think we can figure out the details)

We *may* want to refactor paffinity and maffinity someday, but that's not necessarily what I'm proposing here.


Jeff Squyres
For corporate legal information go to: