Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [hwloc-users] Hwloc error.
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-05-30 15:26:38


We don't need any other info on the hwloc side. And we thank you for
testing the big hwloc warning code :)

For HP:
* If you're lucky, the BIOS may talk about the number of NUMA nodes
(either on the usual messages during boot, or in the BIOS configuration
menu). See if it says 2 on the broken node instead of 4 on other nodes,
you have something easy to tell HP.
* Otherwise we'll have to dig in the SRAT ACPI info. "dmesg | grep SRAT"
should talk about some "PXM" properties, which are basically NUMA
localities. You should see PXM 1 and 2 on the broken node, and PXM 0, 1,
2 and 3 on the other ones. SRAT comes from ACPI, if SRAT is broken, the
hardware/firmware is buggy.

Brice

Le 30/05/2012 21:06, John Hanks a écrit :
> I updated the BIOS and still got the error on this host, then I did
> what I should have done in the first place and checked another
> physically identical host. Of the 4 nodes I have that are the same,
> only this one exhibits the error. At this point I'm blaming a hardware
> problem, if there's any benefit to hwloc for me to send additional
> debugging information I am happy to, otherwise I'm going try to figure
> out how what to say to HP to get this node fixed.
>
> Thanks,
>
> jbh
>
> On Wed, May 30, 2012 at 9:27 AM, John Hanks <john.hanks_at_[hidden]> wrote:
>> I recently inherited these machines and would bet small amounts of
>> hard currency they have never seen a BIOS update since birth. I'll
>> figure out how to update the BIOS and let you know if the error
>> persists.
>>
>> Thanks,
>>
>> jbh
>>
>> On Wed, May 30, 2012 at 9:24 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>> On May 30, 2012, at 11:22 AM, Samuel Thibault wrote:
>>>
>>>> i.e. the kernel reports that socket 0 is completely in node 1, while
>>>> socket 1 is half in node 1 and half in node 2. Do you have more
>>>> information about what the machine actually contains socket- and
>>>> NUMA-wise? The dell website is not really felpful, it talks about 4-16
>>>> cores for the DL165 G7, while you have 24.
>>>
>>> How old is your Dell BIOS firmware? You might need to update it.
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>>
>>> _______________________________________________
>>> hwloc-users mailing list
>>> hwloc-users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> _______________________________________________
> hwloc-users mailing list
> hwloc-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users