Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] Bug report: topology strange on SGI UltraViolet
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2010-07-28 10:37:41


Le 28/07/2010 16:21, Bernd Kallies a écrit :
> We just got one SGI UltraViolet rack, containing 48 NUMA nodes with one
> Octocore Nehalem each, SMT switched on. Essentially the machine is a big
> shared-memory machine, similar to what SGI had with their Itanium-based
> Altix 4700.
>
> OS is SLES11 (2.6.32.12-0.7.1.1381.1.PTF-default x86_64). I used
> hwloc-1.0.2 compiled with gcc.
>
> The lstopo output looks a bit strange. The full output of lstopo is
> attached. It begins with
>
> Machine (1534GB)
> Group4 #0 (1022GB)
> Group3 #0 (510GB)
> Group2 #0 (254GB)
> Group1 #0 (126GB)
> Group0 #0 (62GB)
> NUMANode #0 (phys=0 30GB) + Socket #0 + L3 #0 (24MB)
> L2 #0 (256KB) + L1 #0 (32KB) + Core #0
> PU #0 (phys=0)
> PU #1 (phys=384)
> L2 #1 (256KB) + L1 #1 (32KB) + Core #1
> PU #2 (phys=1)
> PU #3 (phys=385)
> L2 #2 (256KB) + L1 #2 (32KB) + Core #2
> ...
> NUMANode #1 (phys=1 32GB) + Socket #1 + L3 #1 (24MB)
> L2 #8 (256KB) + L1 #8 (32KB) + Core #8
> PU #16 (phys=8)
> PU #17 (phys=392)
> L2 #9 (256KB) + L1 #9 (32KB) + Core #9
> ...
>
> The output essentially says that there are 48 NUMA nodes with 8 cores
> each. Each NUMA node contains 32 GB memory except the 1st one, which
> contains 30 GB. Two NUMA nodes are grouped together as "Group0". Two
> "Group0" are grouped together as "Group1" and so on. There are three
> "Group3" objects, the 1st one contains 16 NUMA nodes with 510 GB, the
> remaining two contain 16 NUMA nodes with 512 GB each. Up to here the
> topology is understandeable. I'm wondering about "Group4", which
> contains the three "Group3" objects. lstopo should print "1534GB"
> instead of "1022GB". There is only one "Group4" object, and there are no
> other direct children of the root object.
>

Indeed, there's something wrong.
Can you send the output of tests/linux/gather_topology.sh so that I try
to debug this from here?

> Moreover, when running applications that use the hwloc API, and call
> functions like hwloc_get_next_obj_by_depth or hwloc_get_obj_by_depth,
> then calling hwloc_topology_destroy or even free() on some
> self-allocated memory, then the app fail at this stage with
>
> *** glibc detected *** a.out: double free or corruption (out).
> or
> *** glibc detected *** a.out: free(): invalid next size (fast):
>

Can you send an example as well?

thanks,
Brice