Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-02-28 14:59:29


You might also want to check the BIOS rev level on node14, Gus - as Brice suggested, it could be that the board came with the wrong firmware.

On Feb 28, 2014, at 11:55 AM, Gus Correa <gus_at_[hidden]> wrote:

> Hi Brice and Ralph
>
> Many thanks for helping out with this!
>
> Yes, you are right about node15 being OK.
> Node15 was a red herring, as along with node14 it was part of
> the same job that failed.
> However, after a closer look, I noticed that failure reported
> by hwloc was indeed in node14.
>
> I attach both diagnostic files generated by hwloc-gather-topology on
> node14.
>
> I will open the node and see if there is anything unusual with the
> hardware, and perhaps reinstall the OS, as Ralph suggested.
> It is awkward that the other node that had the motherboard replaced
> passes the hwloc-gather-topology test.
> After motherboard replacement I renistalled the OS on both,
> but it doesn't hurt to do it again.
>
> Gus Correa
>
>
>
>
> On 02/28/2014 03:26 AM, Brice Goglin wrote:
>> Hello Gus,
>> I'll need the tarball generated by gather-topology on node14 to debug
>> this. node15 doesn't have any issue.
>> We've seen issues on AMD machines because of buggy BIOS reporting
>> incompatible Socket and NUMA info. If node14 doesn't have the same BIOS
>> version as other nodes, that could explain things.
>> Brice
>>
>>
>>
>>
>> Le 28/02/2014 01:39, Gus Correa a écrit :
>>> Thank you, Ralph!
>>>
>>> I did a bit more of homework, and found out that all jobs that had
>>> the hwloc error involved one specific node (node14).
>>>
>>> The "report bindings" output in those jobs' stderr show
>>> that node14 systematically failed to bind the processes to the cores,
>>> while other nodes on the same jobs didn't fail.
>>> Interestingly, the jobs continued to run, although they
>>> eventually failed, but much later.
>>> So, the hwloc error doesn't seem to stop the job on its tracks.
>>> As a matter of policy, should it perhaps shutdown the job instead?
>>>
>>> In addition, when I try the hwloc-gather-topology diagnostic on node14
>>> I get the same error, a bit more verbose (see below).
>>> So, now my guess is that this may be a hardware problem on that node.
>>>
>>> I replaced two nodes' motherboards last week, including node14's,
>>> and something may have gone wrong on that one.
>>> The other node that had the motherboard replaced
>>> doesn't show the hwloc-gather-topology error, though.
>>>
>>> Does the error message below (Socket P#0 ...)
>>> suggest anything that I should be looking for on the hardware side?
>>> (Thermal compound on the heatsink, memory modules, etc)
>>>
>>> Thank you,
>>> Gus Correa
>>>
>>>
>>>
>>> [root_at_node14 ~]# /usr/bin/hwloc-gather-topology /tmp/$(uname -n)
>>> Hierarchy gathered in /tmp/node14.tar.bz2 and kept in
>>> /tmp/tmp.D46Sdhcnru/node14/
>>> ****************************************************************************
>>>
>>> * Hwloc has encountered what looks like an error from the operating
>>> system.
>>> *
>>> * object (Socket P#0 cpuset 0x0000ffff) intersection without inclusion!
>>> * Error occurred in topology.c line 718
>>> *
>>> * Please report this error message to the hwloc user's mailing list,
>>> * along with the output from the hwloc-gather-topology.sh script.
>>> ****************************************************************************
>>>
>>> Expected topology output stored in /tmp/node14.output
>>>
>>>
>>> On 02/27/2014 06:39 PM, Ralph Castain wrote:
>>>> The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having
>>> trouble with those data/instruction cache breakdowns.
>>> I don't know why it wouldn't have shown up before,
>>> however, as this looks to be happening when we first try to
>>> assemble the topology. To check that, what happens if you just run
>>> "mpiexec hostname" on the local node?
>>>>
>>>>
>>>> On Feb 27, 2014, at 3:04 PM, Gus Correa<gus_at_[hidden]> wrote:
>>>>
>>>>> Dear OMPI pros
>>>>>
>>>>> This seems to be a question in the nowhere land between OMPI and hwloc.
>>>>> However, it appeared as an OMPI error, hence it may be OK to ask the
>>>>> question in this list.
>>>>>
>>>>> ***
>>>>>
>>>>> A user here got this error (or warning?) message today:
>>>>>
>>>>> + mpiexec -np 64 $HOME/echam-aiv_ldeo_6.1.00p1/bin/echam6
>>>>> ****************************************************************************
>>>>>
>>>>> * Hwloc has encountered what looks like an error from the operating
>>>>> system.
>>>>> *
>>>>> * object intersection without inclusion!
>>>>> * Error occurred in topology.c line 594
>>>>> *
>>>>> * Please report this error message to the hwloc user's mailing list,
>>>>> * along with the output from the hwloc-gather-topology.sh script.
>>>>> ****************************************************************************
>>>>>
>>>>>
>>>>> Additional info:
>>>>>
>>>>> 1) We have OMPI 1.6.5. This user is using the one built
>>>>> with Intel compilers 2011.13.367.
>>>>>
>>>>> 2) I set these MCA parameters in $OMPI/etc/openmpi-mca-params.conf
>>>>> (includes binding to core):
>>>>>
>>>>> btl = ^tcp
>>>>> orte_tag_output = 1
>>>>> rmaps_base_schedule_policy = core
>>>>> orte_process_binding = core
>>>>> orte_report_bindings = 1
>>>>> opal_paffinity_alone = 1
>>>>>
>>>>>
>>>>> 3) The machines have dual-socket 16-core AMD Opteron 6376 (Abu-Dhabi),
>>>>> which have one FPU for each pair of cores, a hierarchy of caches
>>>>> serving
>>>>> sub-groups of cores, etc.
>>>>> The OS is Linux CentOS 6.4 with stock CentOS OFED.
>>>>> Interconnect is Infiniband QDR (Mellanox HW).
>>>>>
>>>>> 4) We have Torque 4.2.5, built with cpuset support.
>>>>> OMPI is built with Torque (tm) support.
>>>>>
>>>>> 5) In case it helps, I attach the output of
>>>>> hwloc-gather-topology, which I ran on the node that threw the error,
>>>>> although not immediately after the job failure.
>>>>> I used the hwloc-gather-topology script that comes with
>>>>> the hwloc (version 1.5) provided by CentOS.
>>>>> As far as I can tell the hwloc nuts and bits built into OMPI
>>>>> do not include the hwloc-gather-topology script (although it may be
>>>>> a newer hwloc version. 1.8 perhaps?).
>>>>> Hopefully the mail servers won't chop off the attachments.
>>>>>
>>>>> 6) I am a bit surprised by this error message, because I haven't
>>>>> seen it before, although we have used OMPI 1.6.5 in
>>>>> this machine with several other programs without problems.
>>>>> Alas, it happened now.
>>>>>
>>>>> **
>>>>>
>>>>> - Is this a known hwloc problem in this processor architecture?
>>>>>
>>>>> - Is this a known issue in this combination of HW and SW?
>>>>>
>>>>> - Would not binding the MPI processes (to core or socket), perhaps
>>>>> help?
>>>>>
>>>>> - Any workarounds or suggestions?
>>>>>
>>>>> **
>>>>>
>>>>> Thank you,
>>>>> Gus Correa
>>>>> <node15.output><node15.tar.bz2>_______________________________________________
>>>>>
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> <node14.output><node14.tar.bz2>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users