Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] misleading cache size on AMD Opteron 6348?
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2014-06-11 15:20:23


The hwloc version will likely not change much regarding this hardware bug.
Since your hardware/BIOS looks buggy, we can't do much about it except
creating a valid XML that you could force to override the normal
hardware-based discovery.

Brice

Le 11/06/2014 21:16, Yury Vorobyov a écrit :
> I do not see big difference... This time I used upstream version of
> hwloc (not git live).
>
> $ lstopo
> ****************************************************************************
> * hwloc has encountered what looks like an error from the operating
> system.
> *
> * L3 (P#6 cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset
> 0x0000003f) without inclusion!
> * Error occurred in topology.c line 940
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output from the hwloc-gather-topology script.
> ****************************************************************************
> Machine
> Socket L#0
> NUMANode L#0 (P#0)
> L3 L#0 (6144KB)
> L2 L#0 (2048KB) + L1i L#0 (64KB)
> L1d L#0 (16KB) + Core L#0 + PU L#0 (P#0)
> L1d L#1 (16KB) + Core L#1 + PU L#1 (P#1)
> L2 L#1 (2048KB) + L1i L#1 (64KB)
> L1d L#2 (16KB) + Core L#2 + PU L#2 (P#2)
> L1d L#3 (16KB) + Core L#3 + PU L#3 (P#3)
> L2 L#2 (2048KB) + L1i L#2 (64KB)
> L1d L#4 (16KB) + Core L#4 + PU L#4 (P#4)
> L1d L#5 (16KB) + Core L#5 + PU L#5 (P#5)
> NUMANode L#1 (P#1)
> L2 L#3 (2048KB) + L1i L#3 (64KB)
> L1d L#6 (16KB) + Core L#6 + PU L#6 (P#6)
> L1d L#7 (16KB) + Core L#7 + PU L#7 (P#7)
> L2 L#4 (2048KB) + L1i L#4 (64KB)
> L1d L#8 (16KB) + Core L#8 + PU L#8 (P#8)
> L1d L#9 (16KB) + Core L#9 + PU L#9 (P#9)
> L3 L#1 (6144KB) + L2 L#5 (2048KB) + L1i L#5 (64KB)
> L1d L#10 (16KB) + Core L#10 + PU L#10 (P#10)
> L1d L#11 (16KB) + Core L#11 + PU L#11 (P#11)
> Socket L#1
> NUMANode L#2 (P#2)
> L3 L#2 (6144KB) + L2 L#6 (2048KB) + L1i L#6 (64KB)
> L1d L#12 (16KB) + Core L#12 + PU L#12 (P#12)
> L1d L#13 (16KB) + Core L#13 + PU L#13 (P#13)
> L2 L#7 (2048KB) + L1i L#7 (64KB)
> L1d L#14 (16KB) + Core L#14 + PU L#14 (P#14)
> L1d L#15 (16KB) + Core L#15 + PU L#15 (P#15)
> L2 L#8 (2048KB) + L1i L#8 (64KB)
> L1d L#16 (16KB) + Core L#16 + PU L#16 (P#16)
> L1d L#17 (16KB) + Core L#17 + PU L#17 (P#17)
> NUMANode L#3 (P#3)
> L2 L#9 (2048KB) + L1i L#9 (64KB)
> L1d L#18 (16KB) + Core L#18 + PU L#18 (P#18)
> L1d L#19 (16KB) + Core L#19 + PU L#19 (P#19)
> L3 L#3 (6144KB)
> L2 L#10 (2048KB) + L1i L#10 (64KB)
> L1d L#20 (16KB) + Core L#20 + PU L#20 (P#20)
> L1d L#21 (16KB) + Core L#21 + PU L#21 (P#21)
> L2 L#11 (2048KB) + L1i L#11 (64KB)
> L1d L#22 (16KB) + Core L#22 + PU L#22 (P#22)
> L1d L#23 (16KB) + Core L#23 + PU L#23 (P#23)
> HostBridge L#0
> PCIBridge
> PCI 10de:0f00
> PCIBridge
> PCI 8086:10d3
> PCIBridge
> PCI 8086:10d3
> PCIBridge
> PCI 1002:6889
> PCI 1002:4390
> PCI 1002:439c
>
>
>
> On Tue, Apr 1, 2014 at 1:47 PM, Yury Vorobyov <teupollam_at_[hidden]
> <mailto:teupollam_at_[hidden]>> wrote:
>
> Current BIOS version could be improperly detecting CPUs, which
> engineering samples of 6348 (all characteristics are same).
>
>
> On Tue, Apr 1, 2014 at 6:59 PM, Yury Vorobyov <teupollam_at_[hidden]
> <mailto:teupollam_at_[hidden]>> wrote:
>
> The BIOS has latest version. If I should check some BIOS
> information, I have access to hardware. Tell me what variables
> from SMBIOS you want to see?
>
>
> On Fri, Jan 31, 2014 at 1:07 PM, Brice Goglin
> <Brice.Goglin_at_[hidden] <mailto:Brice.Goglin_at_[hidden]>> wrote:
>
> Hello,
>
> Your BIOS reports invalid L3 cache information. On these
> processors, the L3 is shared by 6 cores, it covers 6 cores
> of an entire half-socket NUMA node. But the BIOS says that
> some L3 are shared between 4 cores, others by 6 cores. And
> worse it says that some L3 is shared by some cores from a
> NUMA node and others from another NUMA nodes, which causes
> the error message (and these L3 cannot be inserted in the
> topology).
>
> I see "AMD Eng Sample, ZS268145TCG54_32/26/20_2/16" in the
> processor type, so it might explain why your BIOS is
> somehow experimental. See if you can upgrade it.
>
> Also make sure your kernel isn't too old in case it misses
> L3 info for these processors. At least 3.3 should be OK iirc.
>
> NUMA node sharing info:
> $ cat sys/devices/system/node/node*/cpumap
> 00000000,0000003f
> 00000000,00000fc0
> 00000000,0003f000
> 00000000,00fc0000
> $ cat
> sys/devices/system/cpu/cpu{?,??}/cache/index3/shared_cpu_map
> 00000000,0000000f << wrong, should be 003f
> 00000000,0000000f << wrong, should be 003f
> 00000000,0000000f << wrong, should be 003f
> 00000000,0000000f << wrong, should be 003f
> 00000000,000003f0 <<impossible, should be 003f
> 00000000,000003f0 <<impossible, should be 003f
> 00000000,000003f0 <<impossible, should be 0fc0
> 00000000,000003f0 <<impossible, should be 0fc0
> 00000000,000003f0 <<impossible, should be 0fc0
> 00000000,000003f0 <<impossible, should be 0fc0
> 00000000,00000c00 <<wrong, should be 0fc0
> 00000000,00000c00 <<wrong, should be 0fc0
> 00000000,00003000 <<wrong, should be 003f000
> 00000000,00003000 <<wrong, should be 003f000
> 00000000,000fc000 <<impossible, should be 003f000
> 00000000,000fc000 <<impossible, should be 003f000
> 00000000,000fc000 <<impossible, should be 003f000
> 00000000,000fc000 <<impossible, should be 003f000
> 00000000,000fc000 <<impossible, should be 0fc0000
> 00000000,000fc000 <<impossible, should be 0fc0000
> 00000000,00f00000 <<wrong, should be 0fc0000
> 00000000,00f00000 <<wrong, should be 0fc0000
> 00000000,00f00000 <<wrong, should be 0fc0000
> 00000000,00f00000 <<wrong, should be 0fc0000
>
> Brice
>
>
>
> Le 31/01/2014 03:46, Yury Vorobyov a écrit :
>> I have got error about "intersecting caches".
>>
>> Info from hwloc in attachments.
>>
>> I never got this before. I use "live" builds of OpenMPI
>> directly from repo.
>>
>>
>> _______________________________________________
>> hwloc-users mailing list
>> hwloc-users_at_[hidden] <mailto:hwloc-users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
> _______________________________________________
> hwloc-users mailing list
> hwloc-users_at_[hidden] <mailto:hwloc-users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
>
>
>
>
> _______________________________________________
> hwloc-users mailing list
> hwloc-users_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: http://www.open-mpi.org/community/lists/hwloc-users/2014/06/1039.php