Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] hwloc with Xen system support - some initial code
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2014-01-01 03:25:53


Le 31/12/2013 01:31, Andrew Cooper a écrit :
> Hello,
>
> After much hacking I have some code, which I present here for
> comment/query/criticism, with some ramblings and queries of my own.
>
> Code can be found here:
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/hwloc.git;a=commitdiff;h=6c37406dae887386205124ab9151b9be5812b56a
>
> For anyone wishing to try it out for themselves, there is an extra patch
> required for libxc, available here:
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=commitdiff;h=3585994405b6a73c137309dd4be91f48c71e4903
> (Basically, the existing xc_{topology,numa}info() library calls were
> practically useless, and required the library user to perform the
> hypercall bounce buffering themselves, without access to half the bounce
> infrastructure. I have rewritten them in such a way that libxc does all
> the appropriate bounce buffering.)

Hello,

Overall, the code looks good to me.
Maybe the core/socket/node discovery code could be factorized to reduce
duplication.
There's a little fix attached to improve error management in instantiate().

> First of all, I have hacked at the m4, by copying surrounding code, and
> it now appears to work sanely for me, including --{en,dis}able-xen
> configure options. I have no idea whether what I have done is appropriate.

The attach patch should extend your code to also support building the
xen component as a plugin so we don't enforce the libxen hard dependency
inside the main hwloc library.

However, we need to export 2 additional functions (setup_pu_level and
alloc_obj_cpusets) to plugins first (Xen will be the first plugin-able
component to do the "core" discovery, that's why these functions were
not exported earlier). I'll work at this and then we'll remove the
"HACK" #include from my patch.

> Xen support itself is only usable if explicitly requested, via the
> presence of the HWLOC_XEN environment variable. The Xen backend has a
> higher priority than native, and excludes all other CPU topology
> gathering modules, as the native OS will see the fake topology, and the
> x86 module has no idea which pcpus its vcpus are currently executing on,
> so can't map cpuid results back to reality.

If that environment variable is the way you want to enable Xen discovery
in the end, then priority 52 is a good solution. Maybe use 55 instead in
case we ever have to put something between native and Xen.

Regarding component flags, the usual way to exclude ALL other components
if to put ~0 instead of
    HWLOC_DISC_COMPONENT_TYPE_CPU | HWLOC_DISC_COMPONENT_TYPE_GLOBAL

> As for developing the backend, the documentation was a little lacking.
> It would have vastly helped had there been a sentence describing how one
> is expected to build the topology. What I realised, after far too long
> staring at the spaghetti code in other backends, was that
> hwloc_insert_object_by_cpuset() appears to be my friend, and
> subsequently made the process very simple.

Yes sorry, there are very very few people that develop new backends so
the documentation doesn't really cover these internal details. The only
doc about these is basically comments in hwloc/plugins.h

We should also point to a trivial backend as an example that is easy to
read (certainly not Linux which is horrible because it can use info from
sysfs, /proc/cpuinfo, device-tree, etc.).

> At the moment, topology-xen appears capable of constructing the PUs,
> cores, sockets and numa nodes. I have a stack of queries and bugfixes
> against Xen which I shall bring up on xen-devel in due course.

Crazy question: what if you put hwloc in the hypervisor and just pass
the XML output from the hypervisor to the guest instead of passing all
these socket/node/core information? T

> Chief
> among them is that there the way Xen currently signals an offline PU's
> is to trash its location information in the system topology. This means
> that I can identify a specific PU as being offline, but can only infer
> its position in the topology as I happen to know Xen writes the records
> sequentially.

I think this goes in the "complete_cpuset". See below.

> You might notice that the xml is a little thin on details. One problem
> I have is how to integrate things like the DMI information? I know for
> certain that the linux component will get the correct DMI information
> (as dom0 gets the real SMBios tables), but it is not in an easy form to
> consume from outside the linux module. Then comes the question of how
> to use the *BSD DMI information on BSD systems which can use the xen
> component as-is? One idea would be to have a "native dmi information"
> function pointer which can be optionally implemented, but that would be
> somewhat of an architecture overhaul.

DMI info isn't widely used anyway. Some people may use them to define
some quirks for some machines but I don't think it's critical to have it
in Xen too.

> Also I suspect it would require
> access to the native components private data which doesn't appears to
> exist for the duration of Xen's topology gathering.

It exists. All backends are instantiated at once, then all the discovery
callbacks are called, then all backends info is destroyed. You're not
really supposed to access other backend private data, but we already had
to do it in the linuxpci component anyway. So we had to define a proper
solution if we go there.

> One thing I noticed was about allowed_{cpu,node}set. From what I can
> gather, this is to do with process binding, which is meaningless in the
> context of the Xen system topology. What is the approved way of setting
> these all to 0?

This is related to PUs/nodes that are restricted by Linux control groups.
If you just ignore these allowed fields, they should be set to equal to
the main cpuset/nodeset.

> A problem I encountered was the difference between cpuset, online_cpuset
> and complete_cpuset. I can see that online_cpuset is likely to be
> subset of the others, but I cant find a definitive difference between
> the cpuset and complete_cpuset. Can anyone enlighten me?

These fields are also explained in hwloc.h.

cpuset => contains PU that exist, we know where they are, they are
online, they are allowed (allowed with respect of cgroups on Linux)
online => exist, we know where they are, they are online, they may be
DISALLOWED
allowed => exist, we know where they are, may be OFFLINE
complete => exist, may be DISALLOWED/OFFLINE, we don't know where all of
them are

> What is the canonical way of stating that a certain PU is offline? Xen
> doesn't really do offline cpus of its own accord at the moment but does
> have a hypercalls to explicitly online and offline cpus. In the case
> that one is offline, I suspect my current code will cause the PU to fall
> out of the rest of the topology, rather than stay within it and marked
> differently.

What we do in topology-linux.c is we clear the corresponding bit from
the top-level online_cpuset hwloc_get_root_obj(topology)->cpuset
Then the core will propagate that everywhere at the end of the discovery.

Thanks for all your work and happy new year !

Brice