Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] [WARNING: A/V UNSCANNABLE]Re: [OMPI users] SIGSEGV in opal_hwlock152_hwlock_bitmap_or.A // Bug in 'hwlock" ?
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2013-11-04 16:49:48


Thanks. That's indeed the same bug that you got in Open MPI (reuse of a
hwloc cpuset structure that was freed earlier). It's a nasty bug that
happens when reloading from XML on big machines like yours (that
explains why lstopo works while xmlbuffer and OMPI fail). It was fixed
in hwloc v1.7.1 (hence will be fixed in Open MPI 1.7.4 from what I
understand) but the fix was too big to be backported to older hwloc/OMPI.

You should be able to work around the problem for now by setting
HWLOC_GROUPING=0 in your environment.

I re-added hwloc-users to CC so that the bug is officially "closed".

Brice

Le 04/11/2013 22:33, Paul Kapinos a écrit :
> Hello again,
> I'm not allowed to publish to Hardware locality user list so I omit it
> now.
>
> On 11/04/13 14:19, Brice Goglin wrote:
>> Le 04/11/2013 11:44, Paul Kapinos a écrit :
>>> Hello all,
>>> I.
>>> sorry for this paleontologic excursion. (The 4 years old 'lstopo'
>>> binary was just in my private bin folder and still being runnable..)
>>>
>>> Attached output of newer version 1.5 (Linux-Default one on RHEL/6.4
>>> (SL/6.4).
>>>
>>> II.
>>> I've also tested hwloc-1.5.2 (could not find v.1.5.3) and hwloc-1.7.2
>>> as Brice suggested, by 'confugure' + 'make test' - logs attached.
>>>
>>> 1.5.2 fails:
>>>> /bin/sh: line 5: 20677 Segmentation fault (core dumped) ${dir}$tst
>>>> FAIL: xmlbuffer
>>
>> Can you give more details about this segfault?
>>
>> Try (from the build tree):
>> $ libtool --mode=execute gdb xmlbuffer
>> then type 'run'
>> when it crashes, type 'bt full' and send the output.
>
> see attached file trace_1.5.2.txt
>
>
>
>
>
>>
>> Then please also run from hwloc 1.5.2:
>> * "lstopo foo.xml" and send "foo.xml"
>> * "hwloc-gather-topology foo" and send "foo.tar.bz2"
>
> also attached but with non-empty names :o)
>
>
>
> Best
>
> Paul
>>
>>> whereby 1.7.2 seem to be OK.
>>>
>>> AFAIK in OpenMPI 1.7.4 the version of 'hwlock' has to be updated?
>>> If so, the original issue should be fixed by this, huh?
>>
>> Hard to say before we get details about the crash in xmlbuffer above.
>>
>> Brice
>>
>>
>>>
>>> Many thanks for your help!
>>> Best
>>>
>>> Paul
>>>
>>> pk224850_at_linuxitvc00:~/SVN/mpifasttest/trunk[511]lstopo 1.5
>>> $ lstopo lstopo_linuxitvc00_1.5.txt
>>> $ lstopo lstopo_linuxitvc00_1.5.xml
>>>
>>>
>>>
>>>
>>>
>>> On 11/01/13 15:37, Brice Goglin wrote:
>>>> Sorry, I missed the mail on OMPI-users.
>>>>
>>>> This hwloc looks veeeeeeeeeeeery old. We don't have Misc objects
>>>> instead of
>>>> Groups since we switched from 0.9 to 1.0. You should regenerate the
>>>> XML file
>>>> with a hwloc version that came out after the big bang (or better,
>>>> after the
>>>> asteroid killed the dinosaurs). Please resend that XML from a recent
>>>> hwloc so
>>>> that we can get a better clue of the problem.
>>>>
>>>> Assuming there's a bug in OMPI's hwloc, I would suggests downloading
>>>> hwloc 1.5.3
>>>> and running make check on that machine. And try again with hwloc
>>>> 1.7.2 in case
>>>> that's already fixed.
>>>>
>>>> thanks
>>>> Brice
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Le 01/11/2013 15:24, Jeff Squyres (jsquyres) a écrit :
>>>>> Paul Kapinos originally reported this issue on the OMPI users list.
>>>>>
>>>>> He is showing a stack trace from OMPI-1.7.3, which uses hwloc 1.5.2
>>>>> (note that
>>>>> OMPI 1.7.4 will use hwloc 1.7.2).
>>>>>
>>>>> I tried to read the xml file he provided with the git hwloc master
>>>>> HEAD, and
>>>>> it fails:
>>>>>
>>>>> -----
>>>>> ❯❯❯ ./utils/lstopo -i lstopo_linuxitvc00.xml
>>>>> ignoring depth attribute for object type without depth
>>>>> ignoring depth attribute for object type without depth
>>>>> XML component discovery failed.
>>>>> hwloc_topology_load() failed (Invalid argument).
>>>>> -----
>>>>>
>>>>> Any idea what's happening here?
>>>>>
>>>>> BTW, I can apply the fix to both the OMPI SVN trunk and v1.7 branch
>>>>> (since
>>>>> OMPI v1.7 is now up to hwloc 1.7.2).
>>>>>
>>>>>
>>>>>
>>>>> On Oct 31, 2013, at 1:28 PM, Paul Kapinos
>>>>> <kapinos_at_[hidden]> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> using 1.7.x (1.7.2 and 1.7.3 tested), we get SIGSEGV from somewhere
>>>>> in-deepth of 'hwlock' library - see the attached screenshot.
>>>>>>
>>>>>> Because the error is strongly aligned to just one single node,
>>>>> which in turn
>>>>> is kinda special one (see output of 'lstopo -'), it smells like an
>>>>> error in
>>>>> the 'hwlock' library.
>>>>>>
>>>>>> Is there a way to disable hwlock or to debug it in somehow way?
>>>>>> (besides to build a debug version of hwlock and OpenMPI)
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dipl.-Inform. Paul Kapinos - High Performance Computing,
>>>>>> RWTH Aachen University, Center for Computing and Communication
>>>>>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>>>>> Tel: +49 241/80-24915
>>>>>>
>>>>> <lstopo_linuxitvc00.txt><opal_hwlock_SIGSEGV.png><lstopo_linuxitvc00.xml>_______________________________________________
>>>>>
>>>>>
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquyres_at_[hidden]
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> hwloc-users mailing list
>>>>> hwloc-users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>>>
>>>
>>>
>>
>
>