Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: [hwloc-devel] distances branch
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2011-01-11 11:37:36


Hello,

The "distances" branch will be candidate for merging into trunk in the
near future.

As of hwloc 1.1, we read NUMA distances from the OS/BIOS on Linux,
Solaris and OSF. It gives us a distance matrix that we use to group
objects that are physically closer (see tests/linux/16ia64-8n2s.output
or tests/linux/256ia64-64n2s2c.output).

Here's what's new in the distances branch:

1) We still get the same info from the OS/BIOS. But we expose it to the
application. So in the interface, any object may contain a distance
matrix between its children (or grand-children or ...). Usually, the
root object contains the distances between all NUMA nodes. But it could
also be a socket containing the distances between all its cores. The
distance are normalized floats. Distance to myself is 1. Distance to
others is >=1.

The structure looks like this:

struct hwloc_distances_s {
  unsigned relative_depth; /**< \brief Relative depth of the considered objects
                                 * below the object containing this distance information. */
  unsigned nbobjs; /**< \brief Number of objects considered in the matrix.
                                 * It is the number of descendant objects at \p relative_depth
                                 * below the containing object.
                                 * It corresponds to the result of hwloc_get_nbobjs_inside_cpuset_by_depth. */

  float *latency; /**< \brief Matrix of latencies between objects, stored as a one-dimension array.
                                 * Values are normalized to get 1.0 as the minimal value in the matrix.
                                 * Latency from i-th to j-th object is stored in slot i*nbobjs+j. */
  float latency_max; /**< \brief The maximal value in the matrix. */
  float latency_base; /**< \brief The multiplier that should be applied to matrix values
                                 * to retrieve the original OS-provided latencies.
                                 * Usually 10 on Linux since ACPI SLIT uses 10 for local latency.
                                 */
};

2) On some machines, the OS/BIOS doesn't provide any distances. It's
still possible to feed hwloc with user-given distances between topology
init and load with the following function (or with an environment variable):

int hwloc_topology_set_distance_matrix(hwloc_topology_t topology,
                                       hwloc_obj_type_t type, unsigned nbobjs,
                                       unsigned *os_index, float *distances);

Here's what's *not* in the branch:

1) Right now, the grouping code needs very "clean" distances when
grouping objects (it doesn't know that 2.0 and 2.05 are likely equal)
but we could certainly make this less strict. You could even imagine
benchmarking the machine to measure latencies between all cores and
having hwloc generate the complete hierarchy using distances (instead of
discover sockets and cores from the OS). Can be added later.

2) All the above is actually about latencies between objects. It does
not cover the interconnection graph (or the number of hops) between
objects. This could also be represented as a distance matrix like above
(with integer values starting at 0 instead of float normalized to 1.0),
but it would be meaningless on current HyperTransport generations
(there's a single route between HT localities, it may or may not be the
shortest physical path between them, and it may vary with the type of
packet, ...). More thinking needed here, and it may make us revise the
"latency" names in the above "struct hwloc_distances_s".

Hope all this makes sense, any comments appreciated.
Brice