Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: make hwloc first-class data
From: Kenneth Lloyd (kenneth.lloyd_at_[hidden])
Date: 2010-09-24 09:57:54

I would support making hwloc a first class element (for what it's worth, and
ompi/hwloc makes sense).

The INRIA paper is interesting and insightful but incomplete. It is however
consistent some of our findings. The NUMA computational fabrics for various
codes / data combinations may be learned by unsupervised means through a
TWEANN (topology and weight evolving artificial neural network) and regular
patterns encoded in a structure called a connective, compositional pattern
producing network (CPPN), optimizing effectiveness with efficiency. We
found this necessary to compute on small CPU / GPU (hybrid) asymmetrical

However, this is still experimental. The development trajectory has to
consider the logical evolution from existing to the eventual.

Kenneth A. Lloyd
CEO - Director of Systems Science
Watt Systems Technologies Inc.

-----Original Message-----
From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]] On
Behalf Of George Bosilca
Sent: Thursday, September 23, 2010 10:39 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] RFC: make hwloc first-class data

On Sep 22, 2010, at 21:08 , Jeff Squyres wrote:

> WHAT: Make hwloc a 1st class item in OMPI
> WHY: At least 2 pieces of new functionality want/need to use the hwloc
> WHERE: Put it in ompi/hwloc
> WHEN: Some time in the 1.5 series
> TIMEOUT: Tues teleconf, Oct 5 (about 2 weeks from now)

> A long time ago, I floated the proposal of putting hwloc at the top level
in opal so that parts of OPAL/ORTE/OMPI could use the data directly.  I
didn't have any concrete suggestions at the time about what exactly would
use the hwloc data -- just a feeling that "someone" would want to.
> There are now two solid examples of functionality that want to use hwloc
data directly:
> 1. Sandia + ORNL are working on a proposal for MPI_COMM_SOCKET,
MPI_COMM_NUMA_NODE, MPI_COMM_CORE, ...etc. (those names may not be the right
ones, but you get the idea).  That is, pre-defined communicators that
contain all the MPI procs on the same socket as you, the same NUMA node as
you, the same core as you, ...etc.
> 2. INRIA presented a paper at Euro MPI last week that takes process
distance to NICs into account when coming up with the long-message splitting
ratio for the PML.  E.g., if we have 2 openib NICs with the same bandwidth,
don't just assume that we'll split long messages 50-50 across both of them.
Instead, use NUMA distances to influence calculating the ratio.  See the
paper here:
While the paper is interesting I don't agree with the approach. It is a
minor improvement based on what we have today, in the sense that it will
better split the load between networks based on the NUMA distance. However,
this is a static approach, which do not take into account the global load on
the network, and therefore it is a benchmark type of improvement. I would
rather prefer we get back our dynamic scheduling, which in addition to the
capabilities of the network took into account the speed at which the data
flowed through each one of them (and thus taking into account the current
load on the network).
> A previous objection was that we are increasing our dependencies by making
hwloc be a 1st-class entity in OPAL -- we're hosed if hwloc ever goes out of
business.  Fair enough.  But that being said, hwloc is getting a bit of a
community growing around it: vendors are submitting patches for their
hardware, distros are picking it up, etc.  I certainly can't predict the
future, but hwloc looks in good shape for now.  There is a little risk in
depending on hwloc, but I think it's small enough to be ok.
Same level of risk as if libevent goes out of business, and we still depend
on it.
> Cisco does need to be able to compile OPAL/ORTE without hwloc, however
(for embedded environments where hwloc simply takes up space and adds no
value).  I previously proposed wrapping a subset of the hwloc API with
opal_*() functions.  After thinking about that a bit, that seems like a lot
of work for little benefit -- how does one decide *which* subset of hwloc
should be wrapped?
> Instead, it might be worthwhile to simply put hwloc up in ompi/hwloc
(instead of opal/hwloc).  Indeed, the 2 places that want to use hwloc are up
in the MPI layer -- I'm guessing that most functionality that wants hwloc
will be up in MPI.  And if we do the build system right, we can have
paffinity/hwloc and libmpi's hwloc all link against the same
libhwloc_embedded so that:
> a) there's no duplication in the process, and 
> b) paffinity/hwloc can still be compiled out with the usual mechanisms to
avoid having hwloc in OPAL/ORTE for embedded environments
> (there's a little hand-waving there, but I think we can figure out the
Before making a decision I would love to hear more technical details about
this instead of just hand-waving, simply because we all realize this is a
very difficult task to be realized in a portable way.
> We *may* want to refactor paffinity and maffinity someday, but that's not
necessarily what I'm proposing here.
> Comments?
> -- 
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
devel mailing list