Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] hwloc-1.1 crash when missing a NUMAnode
From: Bernd Kallies (kallies_at_[hidden])
Date: 2011-02-07 09:40:50


On Mon, 2011-02-07 at 14:57 +0100, Brice Goglin wrote:
> Le 07/02/2011 14:34, Bernd Kallies a écrit :
> > Hallo,
> >
> > we currently have some large SMP systems (SGI Ultraviolet, 64 NUMA nodes
> > and 1024 logical procs per OS instance).
> > After reboot of one of them, the system came up with memory of one node
> > missing. In particular, one of the pseudo
> > directories /sys/devices/system/node/nodeXXX is missing (in particular
> > node29 is missing). All CPUs (even those of node29) are found
> > in /proc/cpuinfo and /sys/devices/system/cpuXXX. It is currently not
> > clear if this is a hardware or BIOS or kernel issue.
> >
> > In any event, applications that base on hwloc-1.1 and hwloc-1.1.1 crash
> > with SIGSEGV while loading the topology of this machine. They run within
> > the linux root cpuset, which contains cpus 0-1023 (all CPUs of the
> > system), and mems 0-28,30-63 (node 29 is missing).
> >
> > Here is the traceback of lstopo version 1.1:
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > 0x00000000004119de in hwloc__setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630, depth=0)
> > at topology.c:272
> > 272 groupdistances[groupids[i]-1][groupids[j]-1] += (*distances)[i][j];
> > (gdb) where
> > #0 0x00000000004119de in hwloc__setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630, depth=0)
> > at topology.c:272
> > #1 0x0000000000411f99 in hwloc_setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630,
> > _distance_indexes=0x7fffffff9520) at topology.c:339
> > #2 0x00000000004206df in look_sysfsnode (topology=0x531370, path=0x4276ea "/sys/devices/system/node", found=0x7fffffffd864) at topology-linux.c:1870
> > #3 0x0000000000423161 in hwloc_look_linux (topology=0x531370) at topology-linux.c:2633
> > #4 0x000000000041497b in hwloc_discover (topology=0x531370) at topology.c:1513
> > #5 0x0000000000415d88 in hwloc_topology_load (topology=0x531370) at topology.c:2163
> > #6 0x0000000000403c7a in main (argc=1, argv=0x7fffffffde28) at lstopo.c:434
> >
> > The particular reason of SIGSEGV is that groupids[28] equals to zero,
> > yielding topology.c:272 to evaluate to groupdistances[0][-1] += ...
> > The groupids array is set via hwloc_setup_group_from_min_distance().
> >
> > hwloc-1.1.1 behaves the same. hwloc-1.0.2 works for this machine, but
> > the hwloc topology misses all HWLOC_GROUP objects, which are usually
> > there if the machine is OK.
> >
> > Despite the machine has to be repaired, I'm wondering if hwloc can be
> > hardened against missing components of such large machines. This
> > particular system seems to work fine, even with a missing part, as long
> > as one does not use the CPUs that seem to have no directly attached
> > memory.
> >
> > Attached you find a tar file with the following additional information:
> >
> > - lstopo-1.1.huv04.output: lstopo output (--enable-debug) for the malfunctioning machine until SIGSEGV
> > - lstopo-1.1.huv01.output: lstopo output (--enable-debug) for a similar machine, which is OK
> > - lstopo-1.1.huv01.xml: lstopo xml output for a similar machine, which is OK
> >
> > I may upload hwloc-gather-topology information to somewhere (about 600
> > kByte, which is too big to attach it to emails to your mail server)
> >
>
> Hello Bernd,
>
> We support machines with sparse NUMA numbers. But I am not sure I have
> tested the grouping code on such machines. If you set
> HWLOC_IGNORE_DISTANCES=1 in your environment, does hwloc work? If so, we
> just need some fixes inside the grouping code.

When setting HWLOC_IGNORE_DISTANCES=1, hwloc-1.1 does not crash on this
machine, but produces a somehow unusual topology. Btw. the same topology
error is got when applying a trivial fix to the grouping code, namely

--- src/topology.c.bak 2010-11-25 15:54:33.000000000 +0100
+++ src/topology.c 2011-02-07 10:55:14.000000000 +0100
@@ -269,6 +269,7 @@
       memset(groupdistances, 0, sizeof(groupdistances));
       for(i=0; i<nbobjs; i++)
           for(j=0; j<nbobjs; j++)
+ if(groupids[i] && groupids[j])
               groupdistances[groupids[i]-1][groupids[j]-1] += (*distances)[i][j];
       for(i=0; i<nbgroups; i++)
           for(j=0; j<nbgroups; j++)

The topology error is: the 1st NODE object contains 2 sockets, instead
of one. The additional socket contains the processors of the lost node
29.

I'm not sure how to deal with that, since /sys/devices/system/cpu is
inconsistent with /sys/devices/system/node on this machine. A naive idea
would be to implement such a consistency check. On this machine the node
ID of a cpu is found in
/sys/devices/system/cpu/cpu*/topology/physical_package_id
One may check if this node has a corresponding
/sys/devices/system/node/nodeXXX entry, and ignore a cpu if this is not
the case. But I'm not sure if this directory layout is somehow standard.

I'll send the gather-topology info soon.

Regards BK

> Feel free to send the gather-topology output to me in private. I'll
> work
> on fixing this.
>
> Brice
>
> _______________________________________________
> hwloc-devel mailing list
> hwloc-devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel

-- 
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +49-30-84185-270
Fax: +49-30-84185-311
e-mail: kallies_at_[hidden]