Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] #23: network topology support and v1.0 semantic fixes
From: Samuel Thibault (samuel.thibault_at_[hidden])
Date: 2010-01-07 13:35:34


Hello,

Happy new year btw :D

Considering future network topology support, I believe we probably need
to fix a couple of things before releasing 1.0. Just to sum up the a
bunch of points that have been raised in the past months:

- there should be a way to have the complete toplogy in just one tree,
  so you can browse in it and assign tasks/process/whatever in it,
  according to architectural details provided by hwloc, but also network
  details like bandwidth etc.
- the core of hwloc mustn't force any kind of tools, it must be easy
  to either build something around hwloc detection and binding
  functions, or load detection & binding plugins.

The way I see it is to provide a hwloc_topology_combine() function that
takes a series of several hwloc_topology_t trees and an object type,
and builds a tree that contains a new object of that type, under which
the trees appear. That combination can actually already be done by
hand by catenating xml files. For instance, on a simple cluster you'd
run lstopo on each machine and save xml files, load them together,
combine them under a "network" object (being able to register dynamic
object types should be easy), and save the result as an xml file, which
thus contains the complete topology of the cluster. A task dispatcher
can thus browse it at will etc. When it comes about binding, it'd be
the task dispatcher's role to first run the application to the target
machine, and there run hwloc to perform the actual binding, according to
the cpuset in the tree.

Now, coming to semantic changes:
- The top node of the tree wouldn't necessarily be a system object.
  Actually, having always the top object having the system type is not
  providing any useful information :), and it makes us duplicate fields
  between system and machine. On usual (non-Kerrighed) machines, the top
  node would just be machine. On Kerrighed systems, the top node would
  be system. On networked systems, the top node would be a switch or the
  Internet :)
  As a consequence, hwloc_get_system_obj would have to be renamed to
  hwloc_get_root_obj.
- Objects of network trees may not have cpusets defined (Trees obtained
  directly from hwloc with defaults parameter would still have cpusets
  on every node however). It does not make sense to merge cpusets of
  different machines (they will conflict), and things like shifting
  cpusets to be able to merge them would probably only bring issues.
  That being said, that does not prevent from writing a transparency
  plugin that automatically discovers the network topology, shifts
  cpusets, and when requested for binding, automatically migrates to
  the machine according to the shift, and uses the underlying OS hooks
  to perform the binding. My point is that the hwloc combining operation
  wouldn't fix cpusets itself and leave them NULL. The caller of the
  combining operation will be responsible for that.
- This also means there can't be "global" cpusets like the recently
  added hwloc_topology_get_{topology,complete,online,allowed}_cpuset
  functions (not released yet). These can just be moved to the hwloc_obj
  structure, thus being available for each object, which could actually be
  helpful btw.
- Helpers that take cpuset parameters of course don't make sense any more
  when applied to networked topologies. But it probably doesn't make
  sense for the caller to call them in the first place, and the caller
  knows it since it's the caller that has first called the combining
  operation or loaded an XML file resulting from it.
  If, however, at some point (after having distributed tasks between
  machines for instance), operations with cpusets are desired, we could
  provide a duplication function that takes a topology object parameter
  A and builds a new topology tree containing all the objects under
  A, A thus being its root, and then (if A indeed has a cpuset, but
  the caller should know that) heleprs taking cpuset parameters can be
  called.

So, to sum it up:
- hwloc_get_obj_by_depth(topo, 0, 0) may not be a system object any
  more (actually it'd only be one on kerrighed systems).
- no global cpuset field, only in objects.

The second point shouldn't harm, it's just a matter of fixing the (not
yet released) API. The first point clearly contradicts the current
documentation (“HWLOC_OBJ_SYSTEM will always be the highest”),
but I believe not breaking it as soon as now will tie us from further
extensions anyway, and I don't think much code relies on it anyway.

The plan I see is that for 1.0 we only check that catenating .XML files
by hand to build misc levels representing network layers does indeed
work, which should mean that actual combining functions etc. should be
possible to implement later.

Please comment/disagree/agree :)

Samuel