Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: [hwloc-devel] hwloc-1.1 crash when missing a NUMAnode
From: Bernd Kallies (kallies_at_[hidden])
Date: 2011-02-07 08:34:31


Hallo,

we currently have some large SMP systems (SGI Ultraviolet, 64 NUMA nodes
and 1024 logical procs per OS instance).
After reboot of one of them, the system came up with memory of one node
missing. In particular, one of the pseudo
directories /sys/devices/system/node/nodeXXX is missing (in particular
node29 is missing). All CPUs (even those of node29) are found
in /proc/cpuinfo and /sys/devices/system/cpuXXX. It is currently not
clear if this is a hardware or BIOS or kernel issue.

In any event, applications that base on hwloc-1.1 and hwloc-1.1.1 crash
with SIGSEGV while loading the topology of this machine. They run within
the linux root cpuset, which contains cpus 0-1023 (all CPUs of the
system), and mems 0-28,30-63 (node 29 is missing).

Here is the traceback of lstopo version 1.1:

Program received signal SIGSEGV, Segmentation fault.
0x00000000004119de in hwloc__setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630, depth=0)
    at topology.c:272
272 groupdistances[groupids[i]-1][groupids[j]-1] += (*distances)[i][j];
(gdb) where
#0 0x00000000004119de in hwloc__setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630, depth=0)
    at topology.c:272
#1 0x0000000000411f99 in hwloc_setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630,
    _distance_indexes=0x7fffffff9520) at topology.c:339
#2 0x00000000004206df in look_sysfsnode (topology=0x531370, path=0x4276ea "/sys/devices/system/node", found=0x7fffffffd864) at topology-linux.c:1870
#3 0x0000000000423161 in hwloc_look_linux (topology=0x531370) at topology-linux.c:2633
#4 0x000000000041497b in hwloc_discover (topology=0x531370) at topology.c:1513
#5 0x0000000000415d88 in hwloc_topology_load (topology=0x531370) at topology.c:2163
#6 0x0000000000403c7a in main (argc=1, argv=0x7fffffffde28) at lstopo.c:434

The particular reason of SIGSEGV is that groupids[28] equals to zero,
yielding topology.c:272 to evaluate to groupdistances[0][-1] += ...
The groupids array is set via hwloc_setup_group_from_min_distance().

hwloc-1.1.1 behaves the same. hwloc-1.0.2 works for this machine, but
the hwloc topology misses all HWLOC_GROUP objects, which are usually
there if the machine is OK.

Despite the machine has to be repaired, I'm wondering if hwloc can be
hardened against missing components of such large machines. This
particular system seems to work fine, even with a missing part, as long
as one does not use the CPUs that seem to have no directly attached
memory.

Attached you find a tar file with the following additional information:

- lstopo-1.1.huv04.output: lstopo output (--enable-debug) for the malfunctioning machine until SIGSEGV
- lstopo-1.1.huv01.output: lstopo output (--enable-debug) for a similar machine, which is OK
- lstopo-1.1.huv01.xml: lstopo xml output for a similar machine, which is OK

I may upload hwloc-gather-topology information to somewhere (about 600
kByte, which is too big to attach it to emails to your mail server)

Regards BK

-- 
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +49-30-84185-270
Fax: +49-30-84185-311
e-mail: kallies_at_[hidden]