On Wed, 2010-07-28 at 20:36 +0200, Brice Goglin wrote:
> Actually, all these distance matrices (except the NUMA nodes' one, the
> one not included above) show a ring topology without the link between
> the first and the last object. So grouping makes no sense there. hwloc
> 1.0.x groups object #2N with object #2N+1 because its grouping algorithm
> isn't very clever. It could also link #2N-1 with #2N, it wouldn't be
> worse. The grouping algorithm is more clever in svn trunk. It detects
> this ring properly and does not group anything (except pairs of NUMA node).
> It's actually surprising that this machine doesn't show a better
> distance matrix. I guess SGI still has a hypercube or whatever nice
> topology interconnected IRUs and blades. Older Altix machines had very
> nice distance matrices were we would detect multiple levels of groups
> that really showed the physical hierarchy of blades/IRUs/... I wonder if
> your SGI BIOS is buggy :)
> Michael Raymond, anything to say about this?
Here is the answer from Alexis Cousein from SGI regarding UV topology:
> THe first UV flavour indeed uses a routerless topology, not a fat
> tree one.
> Basically, the system has paired nodes with dual NUMALink5 connectors
> between them, on the signal backplane.
> then on each pair the even node is used to make "horizontal" rings
> (across the four pairs in an IRU half, possibly extending to IRU
> halves at the same height on other racks) and the odd node is used
> to make "vertical" rings (connecting all the odd nodes together that
> are at the same left-right position, on the four IRU halves in
> a rack, possibly extending to a rack that's a lot further if e.g.
> the H-V ring structure is 8-8).
> All these rings, though, should actually be closed, or you have
> missing cables or nodes. The machines are designed to still
> work with those rings broken (if e.g. you pull a blade out) but
> most of these breakages have large performance implications for
> some remote memory accesses that would use the broken
> links on a completely cabled system.
> There are other open rings that are normal, though (if e.g.
> you go from even to odd, vertical to another odd and then back to
> even again, you have a ring that is not closed because the even
> nodes have no vertical connections corresponding to that of the
> odd nodes).
> There is actually another topology possible that looks much more
> like Altix4700, but that will use routers that will only become
> available at the end of the year (and, of course, there is quite some
> extra cost associated with them).
> When we use a batch scheduler, for one rack systems, the node sets
> we tend to use are:
> -memory (half a blade)
> -bladepair ( blades N and N+1 for N even)
> -IRU quadrant
> -IRU half (upper or lower)
> although you could build some extra ones that also make sense.
He basically explains the "network" topology, which, however is similar
to the grouping of NUMA nodes on this SMP machine.
To my opinion, the job hwloc does in forming "groups" is basically OK.
Also the group content makes sense.
The only "strange" thing is, that the grouping code becomes disturbed on
this special machine, which only contains 3/4 of the NUMA nodes that are
found in a fully equipped rack. Physically, the 2nd enclosure is only
half filled. I'm wondering what would happen in a fully equipped rack.
Will there be 4xGroup3, or 2xGroup4 with 2xGroup3 each? From my feeling
the latter should be happen. This also means, that the current machine
should have 2xGroup4, where the 1st one has 2xGroup3, and the 2nd has
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin