Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] hwloc-1.1 crash when missing a NUMAnode
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2011-02-07 08:57:09


Le 07/02/2011 14:34, Bernd Kallies a écrit :
> Hallo,
>
> we currently have some large SMP systems (SGI Ultraviolet, 64 NUMA nodes
> and 1024 logical procs per OS instance).
> After reboot of one of them, the system came up with memory of one node
> missing. In particular, one of the pseudo
> directories /sys/devices/system/node/nodeXXX is missing (in particular
> node29 is missing). All CPUs (even those of node29) are found
> in /proc/cpuinfo and /sys/devices/system/cpuXXX. It is currently not
> clear if this is a hardware or BIOS or kernel issue.
>
> In any event, applications that base on hwloc-1.1 and hwloc-1.1.1 crash
> with SIGSEGV while loading the topology of this machine. They run within
> the linux root cpuset, which contains cpus 0-1023 (all CPUs of the
> system), and mems 0-28,30-63 (node 29 is missing).
>
> Here is the traceback of lstopo version 1.1:
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x00000000004119de in hwloc__setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630, depth=0)
> at topology.c:272
> 272 groupdistances[groupids[i]-1][groupids[j]-1] += (*distances)[i][j];
> (gdb) where
> #0 0x00000000004119de in hwloc__setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630, depth=0)
> at topology.c:272
> #1 0x0000000000411f99 in hwloc_setup_misc_level_from_distances (topology=0x531370, nbobjs=63, objs=0x7fffffffd450, _distances=0x7fffffff9630,
> _distance_indexes=0x7fffffff9520) at topology.c:339
> #2 0x00000000004206df in look_sysfsnode (topology=0x531370, path=0x4276ea "/sys/devices/system/node", found=0x7fffffffd864) at topology-linux.c:1870
> #3 0x0000000000423161 in hwloc_look_linux (topology=0x531370) at topology-linux.c:2633
> #4 0x000000000041497b in hwloc_discover (topology=0x531370) at topology.c:1513
> #5 0x0000000000415d88 in hwloc_topology_load (topology=0x531370) at topology.c:2163
> #6 0x0000000000403c7a in main (argc=1, argv=0x7fffffffde28) at lstopo.c:434
>
> The particular reason of SIGSEGV is that groupids[28] equals to zero,
> yielding topology.c:272 to evaluate to groupdistances[0][-1] += ...
> The groupids array is set via hwloc_setup_group_from_min_distance().
>
> hwloc-1.1.1 behaves the same. hwloc-1.0.2 works for this machine, but
> the hwloc topology misses all HWLOC_GROUP objects, which are usually
> there if the machine is OK.
>
> Despite the machine has to be repaired, I'm wondering if hwloc can be
> hardened against missing components of such large machines. This
> particular system seems to work fine, even with a missing part, as long
> as one does not use the CPUs that seem to have no directly attached
> memory.
>
> Attached you find a tar file with the following additional information:
>
> - lstopo-1.1.huv04.output: lstopo output (--enable-debug) for the malfunctioning machine until SIGSEGV
> - lstopo-1.1.huv01.output: lstopo output (--enable-debug) for a similar machine, which is OK
> - lstopo-1.1.huv01.xml: lstopo xml output for a similar machine, which is OK
>
> I may upload hwloc-gather-topology information to somewhere (about 600
> kByte, which is too big to attach it to emails to your mail server)
>

Hello Bernd,

We support machines with sparse NUMA numbers. But I am not sure I have
tested the grouping code on such machines. If you set
HWLOC_IGNORE_DISTANCES=1 in your environment, does hwloc work? If so, we
just need some fixes inside the grouping code.

Feel free to send the gather-topology output to me in private. I'll work
on fixing this.

Brice