Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5
From: Gus Correa (gus_at_[hidden])
Date: 2014-03-03 17:02:38


Hi Brice

Here are answers to your questions,
and my latest attempt to solve the problem:

1) Kernel version:

The nodes with new motherboards (node14 and node16) have the
same kernel as the nodes with original motherboards (e.g. node15),
as they were cloned from the same node image:

[root_at_node14 ~]# uname -a
Linux node14 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux

[root_at_node16 ~]# uname -a
Linux node16 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux

[root_at_node15 ~]# uname -a
Linux node15 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux

**

2) BIOS setup

Besides having different BIOS versions (AMI 3.5 in the new motherboards
vs. 3.0 in the old ones), there are slight diferences in the BIOS setup.
However, the setup is identical in node14, which had the hwloc problems,
and node16, which didn't have hwloc problems. So I am inclined to think
that any differences in BIOS setup are unlikely to cause the problem.

The only item in the BIOS setting that I think may tangentially affect
this is in Advanced->Processor and Clock Settings, where the new
motherboards set:

PowerNow = enabled
C-state mode = C6
Power Cap = P-state 0
HPC mode = disabled

whereas the old motherboards have

PowerNow=disabled
[and the other thee items above are hidden because of this setting]

Do you think this may cause the hwloc problem?

There are other minor differences in BIOS setup, that I will remove,
but probably not relevant (IDE config., Remote access, etc).

**

3) dmesg|grep SRAT

I attach the results.
They are identical on nodes 14 and 16, and differ from node15
only on the first line:

[gus_at_galera ~]$ diff node14_dmesg_grep_SRAT node15_dmesg_grep_SRAT
1c1
< ACPI: SRAT 00000000dfeaa700 00320 (v02 AMD AGESA 00000001 AMD
00000001)

---
 > ACPI: SRAT 00000000dfeaa6f0 00320 (v02 AMD    AGESA    00000001 AMD 
00000001)
**
4) Cleaned/reseated processors, rebooted node14, ran 
hwloc-gather-topology again.
I opened node14, cleaned and re-seated the processors and heatsinks.
I can't see anything out of the ordinary there.
I rebooted the node and ran hwloc-gather-topology again.
This turn it didn't throw any errors on the terminal window,
which may be a good sign.
[root_at_node14 ~]# hwloc-gather-topology /tmp/`date +"%Y%m%d%H%M"`.$(uname -n)
Hierarchy gathered in /tmp/201403031639.node14.tar.bz2 and kept in 
/tmp/tmp.FM97IQCCKc/201403031639.node14/
Expected topology output stored in /tmp/201403031639.node14.output
I attach the diagnostic files.
Was the problem fixed with the processor re-seating, or is it still there?
You characterized the hwloc error before as this:
On 02/28/2014 03:23 PM, Brice Goglin wrote:
 > OK, the problem is that node14's BIOS reports invalid NUMA info. It
 > properly detects 2 sockets with 16-cores each. But it reports 2 NUMA
 > nodes total, instead of 2 per socket (4 total). And hwloc warns because
 > the cores contained in these NUMA nodes are incompatible with sockets:
 > socket0 contains 0-15
 > socket1 contains 16-23
 > NUMA node0 contains 0-7+16-23
 > NUMA node1 contains 8-15+24-31
 >
After reseating the processors, when I run lstopo on node14,
now it shows four NUMA nodes:
NUMA node L#0 with cores 0-7,
NUMA node L#1 with cores 8-15,
NUMA node L#2 with cores 16-23
NUMA node L#3 with cores 24-31
Is the lstopo output all I need to check?
Or do I need to sweep /sys subdirectories to see if it is consistent?
Which /sys subdirectories should I check?
Or alternatively which files in the hwloc-gather-topology output?
Many thanks for your help,
Gus Correa
On 02/28/2014 03:53 PM, Brice Goglin wrote:
> Le 28/02/2014 21:30, Gus Correa a écrit :
>> Hi Brice
>>
>> The (pdf) output of lstopo shows one L1d (16k) for each core,
>> and one L1i (64k) for each *pair* of cores.
>> Is this wrong?
>
> It's correct. AMD uses this "dual-core compute unit" where L2 and L1i
> are shared but L1d isn't.
>
>> BTW, if there are any helpful web links, or references, or graphs
>> about the AMD cache structure, I would love to know.
>
> I don't have a common place to find all information unfortunately. Cache
> sizes is easy to find, but sharing isn't always specified. I often end
> up reading early processor reviews on tech sites such as
> http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested
>
>> I am a bit skeptical that the BIOS is the culprit because I replaced
>> two motherboards (node14 and node16), and only node14 doesn't pass
>> the hwloc-gather-topology test.
>> Just in case, I attach the diagnostic for node16 also,
>
> Hmmm that's very interesting. I assume you have the same kernels on all
> these machines?
> I have seen a couple cases where the kernel would change the topology
> for a same version of the BIOS (for instance old kernels didn't know
> that L1i is shared by pair of cores on your CPU), but I have never seen
> a case where the kernel changes and *breaks* things.
>
> Can you compare the output of "dmesg | grep SRAT" (or grep SRAT
> /var/log/dmesg or kern.log or whatever on your distro) on these nodes?
> SRAT is the hardware table that the kernel reads before filling sysfs.
> You'll see
> [ 0.000000] SRAT: PXM 0 -> APIC 0x07 -> Node 0
> which basically means that CPU7 is close to NUMA node 0.
>
> If you only see Nodes 0-1 on node14, and Nodes 0-3 on node15 and node16,
> that would at least confirm that the bug is in the hardware.
>
> One last idea could be a different BIOS config, and the BIOS being buggy
> only in one of these configs. I've seen that with "interleaved" NUMA
> memory config in Supermicro BIOS several years ago.
>
> Brice
>
>
>
>> if you want to take a look. :)
>>
>> FYI, the two new motherboards (nodes 14 and 16)
>> have a *newer* BIOS version (AMI, version 3.5, 11/25/2013)
>> then the one in the
>> original nodes (node15 below) (AMI, version 3.0, 08/31/2012).
>> I even thought of upgrading the old nodes' BIOSes ...
>> ... but now I am not so sure about this ... :(
>>
>> New motherboards:
>>
>> [root_at_node14 ~]# dmidecode -s bios-vendor
>> American Megatrends Inc.
>> [root_at_node14 ~]# dmidecode -s bios-version
>> 3.5
>> [root_at_node14 ~]# dmidecode -s bios-release-date
>> 11/25/2013
>>
>> **
>>
>> [root_at_node16 ~]# dmidecode -s bios-vendor
>> American Megatrends Inc.
>> [root_at_node16 ~]# dmidecode -s bios-version
>> 3.5
>> [root_at_node16 ~]# dmidecode -s bios-release-date
>> 11/25/2013
>>
>> **
>>
>> Original motherboard:
>>
>> [root_at_node15 ~]# dmidecode -s bios-vendor
>> American Megatrends Inc.
>> [root_at_node15 ~]# dmidecode -s bios-version
>> 3.0
>> [root_at_node15 ~]# dmidecode -s bios-release-date
>> 08/31/2012
>>
>> **
>>
>> Thanks again for your help and advice.
>>
>> Gus Correa
>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users