Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] #23: network topology support and v1.0 semanticfixes
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-01-11 10:12:00


Been mulling this for a few days; here's my thoughts...

On Jan 7, 2010, at 1:35 PM, Samuel Thibault wrote:

> Considering future network topology support, I believe we probably need
> to fix a couple of things before releasing 1.0. Just to sum up the a
> bunch of points that have been raised in the past months:
>
> - there should be a way to have the complete toplogy in just one tree,
> so you can browse in it and assign tasks/process/whatever in it,
> according to architectural details provided by hwloc, but also network
> details like bandwidth etc.

Are you thinking of adding bandwidth attributes? Or are you thinking of adding weighting between objects in the hierarchy? Or ...?

> - the core of hwloc mustn't force any kind of tools, it must be easy
> to either build something around hwloc detection and binding
> functions, or load detection & binding plugins.
>
> The way I see it is to provide a hwloc_topology_combine() function that
> takes a series of several hwloc_topology_t trees and an object type,
> and builds a tree that contains a new object of that type, under which
> the trees appear. That combination can actually already be done by
> hand by catenating xml files. For instance, on a simple cluster you'd
> run lstopo on each machine and save xml files, load them together,
> combine them under a "network" object (being able to register dynamic
> object types should be easy), and save the result as an xml file, which
> thus contains the complete topology of the cluster. A task dispatcher
> can thus browse it at will etc. When it comes about binding, it'd be
> the task dispatcher's role to first run the application to the target
> machine, and there run hwloc to perform the actual binding, according to
> the cpuset in the tree.

All sounds good.

> Now, coming to semantic changes:
> - The top node of the tree wouldn't necessarily be a system object.
> Actually, having always the top object having the system type is not
> providing any useful information :), and it makes us duplicate fields
> between system and machine. On usual (non-Kerrighed) machines, the top
> node would just be machine. On Kerrighed systems, the top node would
> be system. On networked systems, the top node would be a switch or the
> Internet :)
> As a consequence, hwloc_get_system_obj would have to be renamed to
> hwloc_get_root_obj.
> - Objects of network trees may not have cpusets defined (Trees obtained
> directly from hwloc with defaults parameter would still have cpusets
> on every node however). It does not make sense to merge cpusets of
> different machines (they will conflict), and things like shifting
> cpusets to be able to merge them would probably only bring issues.
> That being said, that does not prevent from writing a transparency
> plugin that automatically discovers the network topology, shifts
> cpusets, and when requested for binding, automatically migrates to
> the machine according to the shift, and uses the underlying OS hooks
> to perform the binding. My point is that the hwloc combining operation
> wouldn't fix cpusets itself and leave them NULL. The caller of the
> combining operation will be responsible for that.

More generally -- some objects can be bound to, some cannot. I assume (per Brice's reply) that we can't bind to PCI objects, so I think making this a full generalization is probably a good thing (especially as hwloc can understand/map more and more kinds of objects).

How does this kind of thing extend to, say, co-processors (such as accelerators, FPGAs, GPGPUs, etc.)?

> - This also means there can't be "global" cpusets like the recently
> added hwloc_topology_get_{topology,complete,online,allowed}_cpuset
> functions (not released yet). These can just be moved to the hwloc_obj
> structure, thus being available for each object, which could actually be
> helpful btw.

I'm not sure I follow -- you say that we can't have "global" cpusets anymore (which I grok), but then you say that we can move them to the hwloc_obj struct. Isn't that the top-level struct? I probably misunderstand here.

> - Helpers that take cpuset parameters of course don't make sense any more
> when applied to networked topologies. But it probably doesn't make
> sense for the caller to call them in the first place, and the caller
> knows it since it's the caller that has first called the combining
> operation or loaded an XML file resulting from it.

Agreed. Perhaps we should have a general query function that can return whether a given object can be bound to or not (e.g., for generic tree-traversal kinds of functionality)...?

> If, however, at some point (after having distributed tasks between
> machines for instance), operations with cpusets are desired, we could
> provide a duplication function that takes a topology object parameter
> A and builds a new topology tree containing all the objects under
> A, A thus being its root, and then (if A indeed has a cpuset, but
> the caller should know that) heleprs taking cpuset parameters can be
> called.
>
> So, to sum it up:
> - hwloc_get_obj_by_depth(topo, 0, 0) may not be a system object any
> more (actually it'd only be one on kerrighed systems).
> - no global cpuset field, only in objects.

Some generic points...

1. How about defining a small set of generic operations based on what you described above? E.g.:

- copy: take a tree with root R; copy it to a new tree (note that R may not be the root of the original tree)
- remove: take a tree with root R; find object X within that tree; remove X and all of its children
- insert: take two trees with roots R and S; find object X within R; copy tree S to become a new child of X
- ...?

> The second point shouldn't harm, it's just a matter of fixing the (not
> yet released) API. The first point clearly contradicts the current
> documentation (“HWLOC_OBJ_SYSTEM will always be the highest”),
> but I believe not breaking it as soon as now will tie us from further
> extensions anyway, and I don't think much code relies on it anyway.

Agreed.

> The plan I see is that for 1.0 we only check that catenating .XML files
> by hand to build misc levels representing network layers does indeed
> work, which should mean that actual combining functions etc. should be
> possible to implement later.

FWIW, I'd prefer to see the combining/etc. functions ASAP -- we could definitely use such things in Open MPI...

-- 
Jeff Squyres
jsquyres_at_[hidden]