Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] Hwloc error.
From: John Hanks (john.hanks_at_[hidden])
Date: 2012-05-31 13:33:16


Closing this for the curious. Took a walk to the datacenter and pulled
this server and a neighbor to compare it to a known good server and
discovered that two DIMMs were installed in the wrong sockets.
Correcting that resolved the missing numa nodes.

Thanks,

jbh

On Wed, May 30, 2012 at 11:27 PM, John Hanks <john.hanks_at_[hidden]> wrote:
> Brice,
>
> Thanks for the advice, I may have gotten lucky. During POST it clearly
> shows 4 nodes, Node 0, Node1 Node 2 and Node 3 with nodes 0 and 3
> marked N/A. Have sent a screenshot of that to HP.
>
> jbh
>
> On Wed, May 30, 2012 at 1:26 PM, Brice Goglin <Brice.Goglin_at_[hidden]> wrote:
>> We don't need any other info on the hwloc side. And we thank you for
>> testing the big hwloc warning code :)
>>
>> For HP:
>> * If you're lucky, the BIOS may talk about the number of NUMA nodes
>> (either on the usual messages during boot, or in the BIOS configuration
>> menu). See if it says 2 on the broken node instead of 4 on other nodes,
>> you have something easy to tell HP.
>> * Otherwise we'll have to dig in the SRAT ACPI info. "dmesg | grep SRAT"
>> should talk about some "PXM" properties, which are basically NUMA
>> localities. You should see PXM 1 and 2 on the broken node, and PXM 0, 1,
>> 2 and 3 on the other ones. SRAT comes from ACPI, if SRAT is broken, the
>> hardware/firmware is buggy.
>>
>> Brice
>>
>>
>>
>>
>> Le 30/05/2012 21:06, John Hanks a écrit :
>>> I updated the BIOS and still got the error on this host, then I did
>>> what I should have done in the first place and checked another
>>> physically identical host. Of the 4 nodes I have that are the same,
>>> only this one exhibits the error. At this point I'm blaming a hardware
>>> problem, if there's any benefit to hwloc for me to send additional
>>> debugging information I am happy to, otherwise I'm going try to figure
>>> out how what to say to HP to get this node fixed.
>>>
>>> Thanks,
>>>
>>> jbh
>>>
>>> On Wed, May 30, 2012 at 9:27 AM, John Hanks <john.hanks_at_[hidden]> wrote:
>>>> I recently inherited these machines and would bet small amounts of
>>>> hard currency they have never seen a BIOS update since birth. I'll
>>>> figure out how to update the BIOS and let you know if the error
>>>> persists.
>>>>
>>>> Thanks,
>>>>
>>>> jbh
>>>>
>>>> On Wed, May 30, 2012 at 9:24 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>>>> On May 30, 2012, at 11:22 AM, Samuel Thibault wrote:
>>>>>
>>>>>> i.e. the kernel reports that socket 0 is completely in node 1, while
>>>>>> socket 1 is half in node 1 and half in node 2. Do you have more
>>>>>> information about what the machine actually contains socket- and
>>>>>> NUMA-wise? The dell website is not really felpful, it talks about 4-16
>>>>>> cores for the DL165 G7, while you have 24.
>>>>>
>>>>> How old is your Dell BIOS firmware?  You might need to update it.
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquyres_at_[hidden]
>>>>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> hwloc-users mailing list
>>>>> hwloc-users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>> _______________________________________________
>>> hwloc-users mailing list
>>> hwloc-users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>
>> _______________________________________________
>> hwloc-users mailing list
>> hwloc-users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users