Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] [OMPI users] SIGSEGV in opal_hwlock152_hwlock_bitmap_or.A // Bug in 'hwlock" ?
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2013-11-04 08:19:28


Le 04/11/2013 11:44, Paul Kapinos a écrit :
> Hello all,
> I.
> sorry for this paleontologic excursion. (The 4 years old 'lstopo'
> binary was just in my private bin folder and still being runnable..)
>
> Attached output of newer version 1.5 (Linux-Default one on RHEL/6.4
> (SL/6.4).
>
> II.
> I've also tested hwloc-1.5.2 (could not find v.1.5.3) and hwloc-1.7.2
> as Brice suggested, by 'confugure' + 'make test' - logs attached.
>
> 1.5.2 fails:
> >/bin/sh: line 5: 20677 Segmentation fault (core dumped) ${dir}$tst
> >FAIL: xmlbuffer

Can you give more details about this segfault?

Try (from the build tree):
$ libtool --mode=execute gdb xmlbuffer
then type 'run'
when it crashes, type 'bt full' and send the output.

Then please also run from hwloc 1.5.2:
* "lstopo foo.xml" and send "foo.xml"
* "hwloc-gather-topology foo" and send "foo.tar.bz2"

> whereby 1.7.2 seem to be OK.
>
> AFAIK in OpenMPI 1.7.4 the version of 'hwlock' has to be updated?
> If so, the original issue should be fixed by this, huh?

Hard to say before we get details about the crash in xmlbuffer above.

Brice

>
> Many thanks for your help!
> Best
>
> Paul
>
> pk224850_at_linuxitvc00:~/SVN/mpifasttest/trunk[511]lstopo 1.5
> $ lstopo lstopo_linuxitvc00_1.5.txt
> $ lstopo lstopo_linuxitvc00_1.5.xml
>
>
>
>
>
> On 11/01/13 15:37, Brice Goglin wrote:
>> Sorry, I missed the mail on OMPI-users.
>>
>> This hwloc looks veeeeeeeeeeeery old. We don't have Misc objects
>> instead of
>> Groups since we switched from 0.9 to 1.0. You should regenerate the
>> XML file
>> with a hwloc version that came out after the big bang (or better,
>> after the
>> asteroid killed the dinosaurs). Please resend that XML from a recent
>> hwloc so
>> that we can get a better clue of the problem.
>>
>> Assuming there's a bug in OMPI's hwloc, I would suggests downloading
>> hwloc 1.5.3
>> and running make check on that machine. And try again with hwloc
>> 1.7.2 in case
>> that's already fixed.
>>
>> thanks
>> Brice
>>
>>
>>
>>
>>
>>
>> Le 01/11/2013 15:24, Jeff Squyres (jsquyres) a écrit :
>>> Paul Kapinos originally reported this issue on the OMPI users list.
>>>
>>> He is showing a stack trace from OMPI-1.7.3, which uses hwloc 1.5.2
>>> (note that
>>> OMPI 1.7.4 will use hwloc 1.7.2).
>>>
>>> I tried to read the xml file he provided with the git hwloc master
>>> HEAD, and
>>> it fails:
>>>
>>> -----
>>> ❯❯❯ ./utils/lstopo -i lstopo_linuxitvc00.xml
>>> ignoring depth attribute for object type without depth
>>> ignoring depth attribute for object type without depth
>>> XML component discovery failed.
>>> hwloc_topology_load() failed (Invalid argument).
>>> -----
>>>
>>> Any idea what's happening here?
>>>
>>> BTW, I can apply the fix to both the OMPI SVN trunk and v1.7 branch
>>> (since
>>> OMPI v1.7 is now up to hwloc 1.7.2).
>>>
>>>
>>>
>>> On Oct 31, 2013, at 1:28 PM, Paul Kapinos
>>> <kapinos_at_[hidden]> wrote:
>>>
>>> > Hello all,
>>> >
>>> > using 1.7.x (1.7.2 and 1.7.3 tested), we get SIGSEGV from somewhere
>>> in-deepth of 'hwlock' library - see the attached screenshot.
>>> >
>>> > Because the error is strongly aligned to just one single node,
>>> which in turn
>>> is kinda special one (see output of 'lstopo -'), it smells like an
>>> error in
>>> the 'hwlock' library.
>>> >
>>> > Is there a way to disable hwlock or to debug it in somehow way?
>>> > (besides to build a debug version of hwlock and OpenMPI)
>>> >
>>> > Best
>>> >
>>> > Paul
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Dipl.-Inform. Paul Kapinos - High Performance Computing,
>>> > RWTH Aachen University, Center for Computing and Communication
>>> > Seffenter Weg 23, D 52074 Aachen (Germany)
>>> > Tel: +49 241/80-24915
>>> >
>>> <lstopo_linuxitvc00.txt><opal_hwlock_SIGSEGV.png><lstopo_linuxitvc00.xml>_______________________________________________
>>>
>>> > users mailing list
>>> > users_at_[hidden]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>
>>>
>>>
>>> _______________________________________________
>>> hwloc-users mailing list
>>> hwloc-users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>
>
>