Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: make hwloc first-class data
From: George Bosilca (bosilca_at_[hidden])
Date: 2010-09-23 12:38:36


On Sep 22, 2010, at 21:08 , Jeff Squyres wrote:

> WHAT: Make hwloc a 1st class item in OMPI
>
> WHY: At least 2 pieces of new functionality want/need to use the hwloc data
>
> WHERE: Put it in ompi/hwloc
>
> WHEN: Some time in the 1.5 series
>
> TIMEOUT: Tues teleconf, Oct 5 (about 2 weeks from now)
>
> --------------------------------------------------------------------------------
>
> A long time ago, I floated the proposal of putting hwloc at the top level in opal so that parts of OPAL/ORTE/OMPI could use the data directly. I didn't have any concrete suggestions at the time about what exactly would use the hwloc data -- just a feeling that "someone" would want to.
>
> There are now two solid examples of functionality that want to use hwloc data directly:
>
> 1. Sandia + ORNL are working on a proposal for MPI_COMM_SOCKET, MPI_COMM_NUMA_NODE, MPI_COMM_CORE, ...etc. (those names may not be the right ones, but you get the idea). That is, pre-defined communicators that contain all the MPI procs on the same socket as you, the same NUMA node as you, the same core as you, ...etc.
>
> 2. INRIA presented a paper at Euro MPI last week that takes process distance to NICs into account when coming up with the long-message splitting ratio for the PML. E.g., if we have 2 openib NICs with the same bandwidth, don't just assume that we'll split long messages 50-50 across both of them. Instead, use NUMA distances to influence calculating the ratio. See the paper here: http://hal.archives-ouvertes.fr/inria-00486178/en/

While the paper is interesting I don't agree with the approach. It is a minor improvement based on what we have today, in the sense that it will better split the load between networks based on the NUMA distance. However, this is a static approach, which do not take into account the global load on the network, and therefore it is a benchmark type of improvement. I would rather prefer we get back our dynamic scheduling, which in addition to the capabilities of the network took into account the speed at which the data flowed through each one of them (and thus taking into account the current load on the network).

> A previous objection was that we are increasing our dependencies by making hwloc be a 1st-class entity in OPAL -- we're hosed if hwloc ever goes out of business. Fair enough. But that being said, hwloc is getting a bit of a community growing around it: vendors are submitting patches for their hardware, distros are picking it up, etc. I certainly can't predict the future, but hwloc looks in good shape for now. There is a little risk in depending on hwloc, but I think it's small enough to be ok.

Same level of risk as if libevent goes out of business, and we still depend on it.

> Cisco does need to be able to compile OPAL/ORTE without hwloc, however (for embedded environments where hwloc simply takes up space and adds no value). I previously proposed wrapping a subset of the hwloc API with opal_*() functions. After thinking about that a bit, that seems like a lot of work for little benefit -- how does one decide *which* subset of hwloc should be wrapped?
>
> Instead, it might be worthwhile to simply put hwloc up in ompi/hwloc (instead of opal/hwloc). Indeed, the 2 places that want to use hwloc are up in the MPI layer -- I'm guessing that most functionality that wants hwloc will be up in MPI. And if we do the build system right, we can have paffinity/hwloc and libmpi's hwloc all link against the same libhwloc_embedded so that:
>
> a) there's no duplication in the process, and
> b) paffinity/hwloc can still be compiled out with the usual mechanisms to avoid having hwloc in OPAL/ORTE for embedded environments
>
> (there's a little hand-waving there, but I think we can figure out the details)

Before making a decision I would love to hear more technical details about this instead of just hand-waving, simply because we all realize this is a very difficult task to be realized in a portable way.

  george.

>
> We *may* want to refactor paffinity and maffinity someday, but that's not necessarily what I'm proposing here.
>
> Comments?
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel