Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] hwloc powerpc rhel5 and power7 patch
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2010-09-16 01:41:21


Le 16/09/2010 06:10, Alexey Kardashevskiy a écrit :
> Hi!
>
> There are 2 problems with the current HWLOC code. The questions are at
> the bottom.
>
> 1. Old kernels (RHEL5.*) do expose some numa nodes via sysfs but there
> is no information regarting cache (L1/L2/L3) and CPU threads. RHEL6
> does that. The proposed patch parses PowerPC's /proc/device-tree and
> add necessary information into the topology.

Great !

> 2. The HWLOC expects numa nodes to be numbered consecutively, like
> 1-2-3-4-5.... However this is not necessary true for PowerPC with
> LPARs or on systems with numa hotswap (do they exist? don't know).

Yes, I've never implemented any sparse-aware code since I haven't ever
seen sparse-numbered system :)

> This was before the patch:
>
> =========================
> os node 0 has cpuset 0xffffffff
> os node 1 has cpuset 0xffffffff,0x0
> os node 4 has cpuset 0xffffffff,,0x0
> os node 5 has cpuset 0xffffffff,,,0x0
> os node 8 has cpuset 0xffffffff,,,,0x0
> os node 9 has cpuset 0xffffffff,,,,,0x0
> os node 12 has cpuset 0xffffffff,,,,,,0x0
> os node 13 has cpuset 0xffffffff,,,,,,,0x0
> node distance matrix:
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13
> 0 10 20 40 40 40 40 40 40 0 1 128 3596701896 0 1
> 1 20 10 40 40 40 40 40 40 4095 3642405872 4095 3642406288
> 0 65536
> 2 128 3596490848 4095 3642406160 4095 3642406048 0 0 128
> 3597792932 0 0 0 0
> 3 128 3598856792 0 0 0 0 0 1 0 218840 0 1 0 0
> 4 40 40 10 20 40 40 40 40 128 3596902928 128 3596700232 4095
> 3642406320
> 5 40 40 20 10 40 40 40 40 0 5 4095 3642406432 0 0
> 6 4095 3642406256 0 0 128 3596923832 256 276108416 4095
> 3642406272 0 0 0 0
> 7 0 0 0 0 256 276173984 128 3598846040 0 191376 128
> 3598846016 256 276108400
> 8 40 40 40 40 10 20 40 40 4095 2587230208 4095 2587260160
> 0 0
> 9 40 40 40 40 20 10 40 40 4095 3642406320 4095 3642406464
> 0 0
> 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 11 0 0 0 0 0 0 0 0 128 3596680064 128 3596687552 128
> 3596679872
> 12 40 40 40 40 40 40 10 20 128 3597793376 128 3598856792 128
> 3597315568
> 13 40 40 40 40 40 40 20 10 0 0 128 3598779600 429496729
> 2576980377
> distance matrix asymmetric ([0,2]=40 != [2,0]=128), aborting

Hmmm :)

> - where do I put IBM-specific code?

Is the device tree linux-specific ? If so, it can stay in linux file as
long as it's not 30k lines :) We already have both sysfs and
/proc/cpuinfo code there anyway.

> - may be there is a better way to detect that no cache info was
> fetched from sysfs

That's something that's not clear to me yet. There will likely be other
cases in the future where we will fetch some info from different
backends, and merging them may not be easy. Do you think the device tree
generally contains more information than sysfs? If so, we could start by
disabling cache info from sysfs when a device-tree is found, and maybe
have a way to change that default (we already have a hidden en variable
to use cpuinfo when sysfs is available).

> - is the coding style ok? :-)

It doesn't look bad.

One question though: Is the device tree completely save-able for
external reuse? We like being able to save /proc and /sys so as to debug
distant machines locally. Doing the same for the device tree would be
great. If so, could you send a tarball of a machine with sparse-numa
numbers? And we'll likely make gather-topology.sh store it too.

> 2. Do not I miss something in my patch in order to solve the problems
> mentioned in the beginning of this mail?

We'll see :)

Brice