On Mon, 20110207 at 14:57 +0100, Brice Goglin wrote:
> Le 07/02/2011 14:34, Bernd Kallies a écrit :
> > Hallo,
> >
> > we currently have some large SMP systems (SGI Ultraviolet, 64 NUMA nodes
> > and 1024 logical procs per OS instance).
> > After reboot of one of them, the system came up with memory of one node
> > missing. In particular, one of the pseudo
> > directories /sys/devices/system/node/nodeXXX is missing (in particular
> > node29 is missing). All CPUs (even those of node29) are found
> > in /proc/cpuinfo and /sys/devices/system/cpuXXX. It is currently not
> > clear if this is a hardware or BIOS or kernel issue.
> >
> > In any event, applications that base on hwloc1.1 and hwloc1.1.1 crash
> > with SIGSEGV while loading the topology of this machine. They run within
> > the linux root cpuset, which contains cpus 01023 (all CPUs of the
> > system), and mems 028,3063 (node 29 is missing).
> >
> > Here is the traceback of lstopo version 1.1:
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > 0x00000000004119de in hwloc__setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630, depth=0)
> > at topology.c:272
> > 272 groupdistances[groupids[i]1][groupids[j]1] += (*distances)[i][j];
> > (gdb) where
> > #0 0x00000000004119de in hwloc__setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630, depth=0)
> > at topology.c:272
> > #1 0x0000000000411f99 in hwloc_setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630,
> > _distance_indexes=0x7fffffff9520) at topology.c:339
> > #2 0x00000000004206df in look_sysfsnode (topology=0x531370, path=0x4276ea "/sys/devices/system/node", found=0x7fffffffd864) at topologylinux.c:1870
> > #3 0x0000000000423161 in hwloc_look_linux (topology=0x531370) at topologylinux.c:2633
> > #4 0x000000000041497b in hwloc_discover (topology=0x531370) at topology.c:1513
> > #5 0x0000000000415d88 in hwloc_topology_load (topology=0x531370) at topology.c:2163
> > #6 0x0000000000403c7a in main (argc=1, argv=0x7fffffffde28) at lstopo.c:434
> >
> > The particular reason of SIGSEGV is that groupids[28] equals to zero,
> > yielding topology.c:272 to evaluate to groupdistances[0][1] += ...
> > The groupids array is set via hwloc_setup_group_from_min_distance().
> >
> > hwloc1.1.1 behaves the same. hwloc1.0.2 works for this machine, but
> > the hwloc topology misses all HWLOC_GROUP objects, which are usually
> > there if the machine is OK.
> >
> > Despite the machine has to be repaired, I'm wondering if hwloc can be
> > hardened against missing components of such large machines. This
> > particular system seems to work fine, even with a missing part, as long
> > as one does not use the CPUs that seem to have no directly attached
> > memory.
> >
> > Attached you find a tar file with the following additional information:
> >
> >  lstopo1.1.huv04.output: lstopo output (enabledebug) for the malfunctioning machine until SIGSEGV
> >  lstopo1.1.huv01.output: lstopo output (enabledebug) for a similar machine, which is OK
> >  lstopo1.1.huv01.xml: lstopo xml output for a similar machine, which is OK
> >
> > I may upload hwlocgathertopology information to somewhere (about 600
> > kByte, which is too big to attach it to emails to your mail server)
> >
>
> Hello Bernd,
>
> We support machines with sparse NUMA numbers. But I am not sure I have
> tested the grouping code on such machines. If you set
> HWLOC_IGNORE_DISTANCES=1 in your environment, does hwloc work? If so, we
> just need some fixes inside the grouping code.
When setting HWLOC_IGNORE_DISTANCES=1, hwloc1.1 does not crash on this
machine, but produces a somehow unusual topology. Btw. the same topology
error is got when applying a trivial fix to the grouping code, namely
 src/topology.c.bak 20101125 15:54:33.000000000 +0100
+++ src/topology.c 20110207 10:55:14.000000000 +0100
@@ 269,6 +269,7 @@
memset(groupdistances, 0, sizeof(groupdistances));
for(i=0; i<nbobjs; i++)
for(j=0; j<nbobjs; j++)
+ if(groupids[i] && groupids[j])
groupdistances[groupids[i]1][groupids[j]1] += (*distances)[i][j];
for(i=0; i<nbgroups; i++)
for(j=0; j<nbgroups; j++)
The topology error is: the 1st NODE object contains 2 sockets, instead
of one. The additional socket contains the processors of the lost node
29.
I'm not sure how to deal with that, since /sys/devices/system/cpu is
inconsistent with /sys/devices/system/node on this machine. A naive idea
would be to implement such a consistency check. On this machine the node
ID of a cpu is found in
/sys/devices/system/cpu/cpu*/topology/physical_package_id
One may check if this node has a corresponding
/sys/devices/system/node/nodeXXX entry, and ignore a cpu if this is not
the case. But I'm not sure if this directory layout is somehow standard.
I'll send the gathertopology info soon.
Regards BK
> Feel free to send the gathertopology output to me in private. I'll
> work
> on fixing this.
>
> Brice
>
> _______________________________________________
> hwlocdevel mailing list
> hwlocdevel_at_[hidden]
> http://www.openmpi.org/mailman/listinfo.cgi/hwlocdevel

Dr. Bernd Kallies
KonradZuseZentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +493084185270
Fax: +493084185311
email: kallies_at_[hidden]
