Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2014-02-28 15:53:16


Le 28/02/2014 21:30, Gus Correa a écrit :
> Hi Brice
>
> The (pdf) output of lstopo shows one L1d (16k) for each core,
> and one L1i (64k) for each *pair* of cores.
> Is this wrong?

It's correct. AMD uses this "dual-core compute unit" where L2 and L1i
are shared but L1d isn't.

> BTW, if there are any helpful web links, or references, or graphs
> about the AMD cache structure, I would love to know.

I don't have a common place to find all information unfortunately. Cache
sizes is easy to find, but sharing isn't always specified. I often end
up reading early processor reviews on tech sites such as
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested

> I am a bit skeptical that the BIOS is the culprit because I replaced
> two motherboards (node14 and node16), and only node14 doesn't pass
> the hwloc-gather-topology test.
> Just in case, I attach the diagnostic for node16 also,

Hmmm that's very interesting. I assume you have the same kernels on all
these machines?
I have seen a couple cases where the kernel would change the topology
for a same version of the BIOS (for instance old kernels didn't know
that L1i is shared by pair of cores on your CPU), but I have never seen
a case where the kernel changes and *breaks* things.

Can you compare the output of "dmesg | grep SRAT" (or grep SRAT
/var/log/dmesg or kern.log or whatever on your distro) on these nodes?
SRAT is the hardware table that the kernel reads before filling sysfs.
You'll see
[ 0.000000] SRAT: PXM 0 -> APIC 0x07 -> Node 0
which basically means that CPU7 is close to NUMA node 0.

If you only see Nodes 0-1 on node14, and Nodes 0-3 on node15 and node16,
that would at least confirm that the bug is in the hardware.

One last idea could be a different BIOS config, and the BIOS being buggy
only in one of these configs. I've seen that with "interleaved" NUMA
memory config in Supermicro BIOS several years ago.

Brice

> if you want to take a look. :)
>
> FYI, the two new motherboards (nodes 14 and 16)
> have a *newer* BIOS version (AMI, version 3.5, 11/25/2013)
> then the one in the
> original nodes (node15 below) (AMI, version 3.0, 08/31/2012).
> I even thought of upgrading the old nodes' BIOSes ...
> ... but now I am not so sure about this ... :(
>
> New motherboards:
>
> [root_at_node14 ~]# dmidecode -s bios-vendor
> American Megatrends Inc.
> [root_at_node14 ~]# dmidecode -s bios-version
> 3.5
> [root_at_node14 ~]# dmidecode -s bios-release-date
> 11/25/2013
>
> **
>
> [root_at_node16 ~]# dmidecode -s bios-vendor
> American Megatrends Inc.
> [root_at_node16 ~]# dmidecode -s bios-version
> 3.5
> [root_at_node16 ~]# dmidecode -s bios-release-date
> 11/25/2013
>
> **
>
> Original motherboard:
>
> [root_at_node15 ~]# dmidecode -s bios-vendor
> American Megatrends Inc.
> [root_at_node15 ~]# dmidecode -s bios-version
> 3.0
> [root_at_node15 ~]# dmidecode -s bios-release-date
> 08/31/2012
>
> **
>
> Thanks again for your help and advice.
>
> Gus Correa
>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users