Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: [hwloc-devel] hwloc with Xen system support - some initial code
From: Andrew Cooper (andrew.cooper3_at_[hidden])
Date: 2013-12-30 19:31:20


After much hacking I have some code, which I present here for
comment/query/criticism, with some ramblings and queries of my own.

Code can be found here:;a=commitdiff;h=6c37406dae887386205124ab9151b9be5812b56a

For anyone wishing to try it out for themselves, there is an extra patch
required for libxc, available here:;a=commitdiff;h=3585994405b6a73c137309dd4be91f48c71e4903
(Basically, the existing xc_{topology,numa}info() library calls were
practically useless, and required the library user to perform the
hypercall bounce buffering themselves, without access to half the bounce
infrastructure. I have rewritten them in such a way that libxc does all
the appropriate bounce buffering.)

First of all, I have hacked at the m4, by copying surrounding code, and
it now appears to work sanely for me, including --{en,dis}able-xen
configure options. I have no idea whether what I have done is appropriate.

Xen support itself is only usable if explicitly requested, via the
presence of the HWLOC_XEN environment variable. The Xen backend has a
higher priority than native, and excludes all other CPU topology
gathering modules, as the native OS will see the fake topology, and the
x86 module has no idea which pcpus its vcpus are currently executing on,
so can't map cpuid results back to reality.

As for developing the backend, the documentation was a little lacking.
It would have vastly helped had there been a sentence describing how one
is expected to build the topology. What I realised, after far too long
staring at the spaghetti code in other backends, was that
hwloc_insert_object_by_cpuset() appears to be my friend, and
subsequently made the process very simple.

At the moment, topology-xen appears capable of constructing the PUs,
cores, sockets and numa nodes. I have a stack of queries and bugfixes
against Xen which I shall bring up on xen-devel in due course. Chief
among them is that there the way Xen currently signals an offline PU's
is to trash its location information in the system topology. This means
that I can identify a specific PU as being offline, but can only infer
its position in the topology as I happen to know Xen writes the records
sequentially. Another is that Xen enumerates the cores per-socket,
rather than on an absolute numbering scheme. There is a gross hack to
"fix" this in hwloc_get_xen_info() but is only valid for homogeneous

Somewhat expectedly, Xen has interface for gathering the cache
information. dom0 using the cpuid instruction can gather cache
information that way, but cant map a particular cpuid call to a specific
PU. It would certainly be useful to get, so I have half a mind to add a
new Xen sysctl hypercall of "please execute cpuid with these parameters
on that specific cpu and hand me the results". That would at least
provide an ability to get the information.

Completely expectedly (as this is still an open problem upstream), Xen
has no ability to work out the NUMA affinity of IO devices.

Dom0 in Xen is a bit of a strange system. It has almost
completely-unrestricted access to all the PCI devices in the system, and
access to the real ACPI tables and special BIOS areas in low RAM. Xen
controls all interrupts, the virtualisation hardware and the serial
UART; dom0 has the rest. Dom0 has an explicitly faked up cpu topology
via cpuid, and a non-contiguous address space for its RAM, yet real
cache information via cpuid, and a mostly non-faked feature set.

Most importantly, Xen does not have an AML VM; it can read the static
ACPI tables, but relies on dom0 to provide any information required by
executing AML, which includes any _PXM attributes for IO devices. Xen
itself has little/no use for the _PXM information as it has no device
drivers, yet dom0 doesn't have an accurate view of the CPU/RAM topology
with which to associate the _PXM information. (It is all quite a mess)

Attached are two (non-io) xml outputs from the most interesting AMD and
Intel servers I could easily put my hands on from our testing pool. As
far as I am aware, the drawn diagrams matches my understanding of how
the server topology is organised.

You might notice that the xml is a little thin on details. One problem
I have is how to integrate things like the DMI information? I know for
certain that the linux component will get the correct DMI information
(as dom0 gets the real SMBios tables), but it is not in an easy form to
consume from outside the linux module. Then comes the question of how
to use the *BSD DMI information on BSD systems which can use the xen
component as-is? One idea would be to have a "native dmi information"
function pointer which can be optionally implemented, but that would be
somewhat of an architecture overhaul. Also I suspect it would require
access to the native components private data which doesn't appears to
exist for the duration of Xen's topology gathering.

One thing I noticed was about allowed_{cpu,node}set. From what I can
gather, this is to do with process binding, which is meaningless in the
context of the Xen system topology. What is the approved way of setting
these all to 0?

A problem I encountered was the difference between cpuset, online_cpuset
and complete_cpuset. I can see that online_cpuset is likely to be
subset of the others, but I cant find a definitive difference between
the cpuset and complete_cpuset. Can anyone enlighten me?

What is the canonical way of stating that a certain PU is offline? Xen
doesn't really do offline cpus of its own accord at the moment but does
have a hypercalls to explicitly online and offline cpus. In the case
that one is offline, I suspect my current code will cause the PU to fall
out of the rest of the topology, rather than stay within it and marked

Anyway - I think I should stop rambling. I would appreciate code
review/comments on the code itself, particularly with whether I am using
the API correctly (its all backwards WRT how the docs are written).