Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-02-27 20:48:11


On Feb 27, 2014, at 4:39 PM, Gus Correa <gus_at_[hidden]> wrote:

> Thank you, Ralph!
>
> I did a bit more of homework, and found out that all jobs that had
> the hwloc error involved one specific node (node14).
>
> The "report bindings" output in those jobs' stderr show
> that node14 systematically failed to bind the processes to the cores,
> while other nodes on the same jobs didn't fail.
> Interestingly, the jobs continued to run, although they
> eventually failed, but much later.
> So, the hwloc error doesn't seem to stop the job on its tracks.
> As a matter of policy, should it perhaps shutdown the job instead?

We've debated that over the years, but settled on not aborting just because binding failed. Instead, we are supposed to emit a warning about the job perhaps not performing as desired/expected, and then continue

>
> In addition, when I try the hwloc-gather-topology diagnostic on node14 I get the same error, a bit more verbose (see below).
> So, now my guess is that this may be a hardware problem on that node.

Something doesn't look right, but I doubt it is actually a hardware issue. It appears that the /proc data has been compromised, so I would think that the local disk may be suspect, or the OS may have scribbled where it shouldn't

Remember, hwloc doesn't actually "sense" hardware - it just parses files in the /proc area. So if something is garbled in those files, hwloc will report errors. Doesn't mean anything is wrong with the hardware at all.

>
> I replaced two nodes' motherboards last week, including node14's,
> and something may have gone wrong on that one.

I'd suspect something happened to the file system on the disk, perhaps the kernel came up wrong and the /proc area didn't get updated properly. You might try reinstalling the OS and see if it clears up

> The other node that had the motherboard replaced
> doesn't show the hwloc-gather-topology error, though.
>
> Does the error message below (Socket P#0 ...)
> suggest anything that I should be looking for on the hardware side?
> (Thermal compound on the heatsink, memory modules, etc)
>
> Thank you,
> Gus Correa
>
>
>
> [root_at_node14 ~]# /usr/bin/hwloc-gather-topology /tmp/$(uname -n)
> Hierarchy gathered in /tmp/node14.tar.bz2 and kept in /tmp/tmp.D46Sdhcnru/node14/
> ****************************************************************************
> * Hwloc has encountered what looks like an error from the operating system.
> *
> * object (Socket P#0 cpuset 0x0000ffff) intersection without inclusion!
> * Error occurred in topology.c line 718
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output from the hwloc-gather-topology.sh script.
> ****************************************************************************
> Expected topology output stored in /tmp/node14.output
>
>
> On 02/27/2014 06:39 PM, Ralph Castain wrote:
>> The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having
> trouble with those data/instruction cache breakdowns.
> I don't know why it wouldn't have shown up before,
> however, as this looks to be happening when we first try to
> assemble the topology. To check that, what happens if you just run
> "mpiexec hostname" on the local node?
>>
>>
>> On Feb 27, 2014, at 3:04 PM, Gus Correa<gus_at_[hidden]> wrote:
>>
>>> Dear OMPI pros
>>>
>>> This seems to be a question in the nowhere land between OMPI and hwloc.
>>> However, it appeared as an OMPI error, hence it may be OK to ask the question in this list.
>>>
>>> ***
>>>
>>> A user here got this error (or warning?) message today:
>>>
>>> + mpiexec -np 64 $HOME/echam-aiv_ldeo_6.1.00p1/bin/echam6
>>> ****************************************************************************
>>> * Hwloc has encountered what looks like an error from the operating system.
>>> *
>>> * object intersection without inclusion!
>>> * Error occurred in topology.c line 594
>>> *
>>> * Please report this error message to the hwloc user's mailing list,
>>> * along with the output from the hwloc-gather-topology.sh script.
>>> ****************************************************************************
>>>
>>> Additional info:
>>>
>>> 1) We have OMPI 1.6.5. This user is using the one built
>>> with Intel compilers 2011.13.367.
>>>
>>> 2) I set these MCA parameters in $OMPI/etc/openmpi-mca-params.conf (includes binding to core):
>>>
>>> btl = ^tcp
>>> orte_tag_output = 1
>>> rmaps_base_schedule_policy = core
>>> orte_process_binding = core
>>> orte_report_bindings = 1
>>> opal_paffinity_alone = 1
>>>
>>>
>>> 3) The machines have dual-socket 16-core AMD Opteron 6376 (Abu-Dhabi),
>>> which have one FPU for each pair of cores, a hierarchy of caches serving
>>> sub-groups of cores, etc.
>>> The OS is Linux CentOS 6.4 with stock CentOS OFED.
>>> Interconnect is Infiniband QDR (Mellanox HW).
>>>
>>> 4) We have Torque 4.2.5, built with cpuset support.
>>> OMPI is built with Torque (tm) support.
>>>
>>> 5) In case it helps, I attach the output of
>>> hwloc-gather-topology, which I ran on the node that threw the error,
>>> although not immediately after the job failure.
>>> I used the hwloc-gather-topology script that comes with
>>> the hwloc (version 1.5) provided by CentOS.
>>> As far as I can tell the hwloc nuts and bits built into OMPI
>>> do not include the hwloc-gather-topology script (although it may be a newer hwloc version. 1.8 perhaps?).
>>> Hopefully the mail servers won't chop off the attachments.
>>>
>>> 6) I am a bit surprised by this error message, because I haven't
>>> seen it before, although we have used OMPI 1.6.5 in
>>> this machine with several other programs without problems.
>>> Alas, it happened now.
>>>
>>> **
>>>
>>> - Is this a known hwloc problem in this processor architecture?
>>>
>>> - Is this a known issue in this combination of HW and SW?
>>>
>>> - Would not binding the MPI processes (to core or socket), perhaps help?
>>>
>>> - Any workarounds or suggestions?
>>>
>>> **
>>>
>>> Thank you,
>>> Gus Correa
>>> <node15.output><node15.tar.bz2>_______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users