Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] [WARNING: A/V UNSCANNABLE]Re: [OMPI users] SIGSEGV in opal_hwlock152_hwlock_bitmap_or.A // Bug in 'hwlock" ?
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-11-04 16:51:51


You should be able to grab an Open MPI 1.7.x nightly tarball, and it should have the newer hwloc that fixes this issue.

Can you give it a whirl and see it works for you?


On Nov 4, 2013, at 1:49 PM, Brice Goglin <Brice.Goglin_at_[hidden]> wrote:

> Thanks. That's indeed the same bug that you got in Open MPI (reuse of a
> hwloc cpuset structure that was freed earlier). It's a nasty bug that
> happens when reloading from XML on big machines like yours (that
> explains why lstopo works while xmlbuffer and OMPI fail). It was fixed
> in hwloc v1.7.1 (hence will be fixed in Open MPI 1.7.4 from what I
> understand) but the fix was too big to be backported to older hwloc/OMPI.
>
> You should be able to work around the problem for now by setting
> HWLOC_GROUPING=0 in your environment.
>
> I re-added hwloc-users to CC so that the bug is officially "closed".
>
> Brice
>
>
>
>
> Le 04/11/2013 22:33, Paul Kapinos a écrit :
>> Hello again,
>> I'm not allowed to publish to Hardware locality user list so I omit it
>> now.
>>
>> On 11/04/13 14:19, Brice Goglin wrote:
>>> Le 04/11/2013 11:44, Paul Kapinos a écrit :
>>>> Hello all,
>>>> I.
>>>> sorry for this paleontologic excursion. (The 4 years old 'lstopo'
>>>> binary was just in my private bin folder and still being runnable..)
>>>>
>>>> Attached output of newer version 1.5 (Linux-Default one on RHEL/6.4
>>>> (SL/6.4).
>>>>
>>>> II.
>>>> I've also tested hwloc-1.5.2 (could not find v.1.5.3) and hwloc-1.7.2
>>>> as Brice suggested, by 'confugure' + 'make test' - logs attached.
>>>>
>>>> 1.5.2 fails:
>>>>> /bin/sh: line 5: 20677 Segmentation fault (core dumped) ${dir}$tst
>>>>> FAIL: xmlbuffer
>>>
>>> Can you give more details about this segfault?
>>>
>>> Try (from the build tree):
>>> $ libtool --mode=execute gdb xmlbuffer
>>> then type 'run'
>>> when it crashes, type 'bt full' and send the output.
>>
>> see attached file trace_1.5.2.txt
>>
>>
>>
>>
>>
>>>
>>> Then please also run from hwloc 1.5.2:
>>> * "lstopo foo.xml" and send "foo.xml"
>>> * "hwloc-gather-topology foo" and send "foo.tar.bz2"
>>
>> also attached but with non-empty names :o)
>>
>>
>>
>> Best
>>
>> Paul
>>>
>>>> whereby 1.7.2 seem to be OK.
>>>>
>>>> AFAIK in OpenMPI 1.7.4 the version of 'hwlock' has to be updated?
>>>> If so, the original issue should be fixed by this, huh?
>>>
>>> Hard to say before we get details about the crash in xmlbuffer above.
>>>
>>> Brice
>>>
>>>
>>>>
>>>> Many thanks for your help!
>>>> Best
>>>>
>>>> Paul
>>>>
>>>> pk224850_at_linuxitvc00:~/SVN/mpifasttest/trunk[511]lstopo 1.5
>>>> $ lstopo lstopo_linuxitvc00_1.5.txt
>>>> $ lstopo lstopo_linuxitvc00_1.5.xml
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 11/01/13 15:37, Brice Goglin wrote:
>>>>> Sorry, I missed the mail on OMPI-users.
>>>>>
>>>>> This hwloc looks veeeeeeeeeeeery old. We don't have Misc objects
>>>>> instead of
>>>>> Groups since we switched from 0.9 to 1.0. You should regenerate the
>>>>> XML file
>>>>> with a hwloc version that came out after the big bang (or better,
>>>>> after the
>>>>> asteroid killed the dinosaurs). Please resend that XML from a recent
>>>>> hwloc so
>>>>> that we can get a better clue of the problem.
>>>>>
>>>>> Assuming there's a bug in OMPI's hwloc, I would suggests downloading
>>>>> hwloc 1.5.3
>>>>> and running make check on that machine. And try again with hwloc
>>>>> 1.7.2 in case
>>>>> that's already fixed.
>>>>>
>>>>> thanks
>>>>> Brice
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Le 01/11/2013 15:24, Jeff Squyres (jsquyres) a écrit :
>>>>>> Paul Kapinos originally reported this issue on the OMPI users list.
>>>>>>
>>>>>> He is showing a stack trace from OMPI-1.7.3, which uses hwloc 1.5.2
>>>>>> (note that
>>>>>> OMPI 1.7.4 will use hwloc 1.7.2).
>>>>>>
>>>>>> I tried to read the xml file he provided with the git hwloc master
>>>>>> HEAD, and
>>>>>> it fails:
>>>>>>
>>>>>> -----
>>>>>> ❯❯❯ ./utils/lstopo -i lstopo_linuxitvc00.xml
>>>>>> ignoring depth attribute for object type without depth
>>>>>> ignoring depth attribute for object type without depth
>>>>>> XML component discovery failed.
>>>>>> hwloc_topology_load() failed (Invalid argument).
>>>>>> -----
>>>>>>
>>>>>> Any idea what's happening here?
>>>>>>
>>>>>> BTW, I can apply the fix to both the OMPI SVN trunk and v1.7 branch
>>>>>> (since
>>>>>> OMPI v1.7 is now up to hwloc 1.7.2).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Oct 31, 2013, at 1:28 PM, Paul Kapinos
>>>>>> <kapinos_at_[hidden]> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> using 1.7.x (1.7.2 and 1.7.3 tested), we get SIGSEGV from somewhere
>>>>>> in-deepth of 'hwlock' library - see the attached screenshot.
>>>>>>>
>>>>>>> Because the error is strongly aligned to just one single node,
>>>>>> which in turn
>>>>>> is kinda special one (see output of 'lstopo -'), it smells like an
>>>>>> error in
>>>>>> the 'hwlock' library.
>>>>>>>
>>>>>>> Is there a way to disable hwlock or to debug it in somehow way?
>>>>>>> (besides to build a debug version of hwlock and OpenMPI)
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dipl.-Inform. Paul Kapinos - High Performance Computing,
>>>>>>> RWTH Aachen University, Center for Computing and Communication
>>>>>>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>>>>>> Tel: +49 241/80-24915
>>>>>>>
>>>>>> <lstopo_linuxitvc00.txt><opal_hwlock_SIGSEGV.png><lstopo_linuxitvc00.xml>_______________________________________________
>>>>>>
>>>>>>
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> jsquyres_at_[hidden]
>>>>>> For corporate legal information go to:
>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> hwloc-users mailing list
>>>>>> hwloc-users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>>>>
>>>>
>>>>
>>>
>>
>>
>


--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/