Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] hwloc-1.1 crash when missing a NUMAnode
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2011-02-07 10:48:56

Le 07/02/2011 15:40, Bernd Kallies a écrit :
> When setting HWLOC_IGNORE_DISTANCES=1, hwloc-1.1 does not crash on this
> machine, but produces a somehow unusual topology.

Unusual but not so wrong given what the OS/BIOS says.
> Btw. the same topology
> error is got when applying a trivial fix to the grouping code, namely
> --- src/topology.c.bak 2010-11-25 15:54:33.000000000 +0100
> +++ src/topology.c 2011-02-07 10:55:14.000000000 +0100
> @@ -269,6 +269,7 @@
> memset(groupdistances, 0, sizeof(groupdistances));
> for(i=0; i<nbobjs; i++)
> for(j=0; j<nbobjs; j++)
> + if(groupids[i] && groupids[j])
> groupdistances[groupids[i]-1][groupids[j]-1] += (*distances)[i][j];
> for(i=0; i<nbgroups; i++)
> for(j=0; j<nbgroups; j++)

Your patch looks good, thanks.

> The topology error is: the 1st NODE object contains 2 sockets, instead
> of one. The additional socket contains the processors of the lost node
> 29.
> I'm not sure how to deal with that, since /sys/devices/system/cpu is
> inconsistent with /sys/devices/system/node on this machine.
> A naive idea
> would be to implement such a consistency check. On this machine the node
> ID of a cpu is found in
> /sys/devices/system/cpu/cpu*/topology/physical_package_id

The physical package id is a socket id, it's not the same than the
memory node id. At least on Itanium machines you can have two physical
package id within the same numa node (or any non-NUMA pre-Nehalem Intel
machine). With AMD Magny-Cours you have two NUMA node ids inside the
same physical package id.

I don't think there is an inconsistency between /sys/devices/system/cpu
and /node here. It's just a bogus topology info that the kernel exposes
in a consistent manner.