Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [hwloc-devel] xml file load incompatibilities
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2013-09-20 19:06:15


Try adding HWLOC_DEBUG_CHECK=1 in your environment, it will enable many
assertions at the end of hwloc_topology_load()

Brice

Le 21/09/2013 01:03, Ralph Castain a écrit :
> I didn't try loading it with lstopo - just tried the OMPI trunk. It
> loads okay, but segfaults when you try to find an object by depth
>
> #0 0x00000001005fe5dc in opal_hwloc172_hwloc_get_obj_by_depth
> (topology=Cannot access memory at address 0xfffffffffffffff7
> ) at traversal.c:623
> #1 0x0000000100b6dfaa in opal_hwloc172_hwloc_get_root_obj
> (topology=Cannot access memory at address 0xfffffffffffffff7
> ) at rmaps_rr_mappers.c:747
> #2 0x0000000100b6e139 in orte_rmaps_rr_byslot (jdata=Cannot access
> memory at address 0xffffffffffffff77
> ) at rmaps_rr_mappers.c:774
> #3 0x0000000100b6d6da in orte_rmaps_rr_map (jdata=Cannot access
> memory at address 0xffffffffffffff17
> ) at rmaps_rr.c:211
> #4 0x0000000100353098 in orte_rmaps_base_map_job (fd=Cannot access
> memory at address 0xfffffffffffffe7b
> ) at base/rmaps_base_map_job.c:320
> #5 0x00000001005ce28c in event_process_active_single_queue
> (base=Cannot access memory at address 0xffffffffffffffe7
> ) at event.c:1367
> #6 0x00000001005ce500 in event_process_active (base=Cannot access
> memory at address 0xffffffffffffffe7
> ) at event.c:1437
> #7 0x00000001005ceb71 in opal_libevent2021_event_base_loop
> (base=Cannot access memory at address 0xffffffffffffffb7
> ) at event.c:1645
> #8 0x00000001002c5158 in orterun (argc=Cannot access memory at
> address 0xfffffffffffffd1b
> ) at orterun.c:3039
> #9 0x00000001002c32a4 in main (argc=Cannot access memory at address
> 0xfffffffffffffffb
> ) at main.c:14
>
> Looks to me like memory may be getting hosed
>
>
> On Sep 20, 2013, at 2:59 PM, Brice Goglin <Brice.Goglin_at_[hidden]
> <mailto:Brice.Goglin_at_[hidden]>> wrote:
>
>> I can't see any segfault. Where does the segfault occurs for you? In
>> OMPI only (or lstopo too)? When loading or when using the topology?
>>
>> I tried lstopo on that file with and without HWLOC_NO_LIBXML_IMPORT=1
>> (in case the bug is in one of XML backends), looks ok.
>>
>> Brice
>>
>>
>>
>>
>>
>> Le 20/09/2013 23:53, Ralph Castain a écrit :
>>> Here are the two files I tried - not from the same machine. The foo.xml works, the topo.xml segfaults
>>>
>>>
>>>
>>>
>>> One of our users reported it from their machine, but I don't have their topo file.
>>>
>>> On Sep 20, 2013, at 2:41 PM, Brice Goglin <Brice.Goglin_at_[hidden]> wrote:
>>>
>>>> Hello,
>>>> I don't see anything reason for such an incompatibility. But there are
>>>> many combinations, we can't test everything.
>>>> I can't reproduce that on my machines. Can you send the XML output of
>>>> both versions on one of your machines?
>>>> Brice
>>>>
>>>>
>>>>
>>>> Le 20/09/2013 23:32, Ralph Castain a écrit :
>>>>> Hi folks
>>>>>
>>>>> I've run across a rather strange behavior. We have two branches in OMPI - the devel trunk (using hwloc v1.7.2) and our feature release series (using hwloc 1.5.2). I have found the following:
>>>>>
>>>>> *the feature series can correctly load an xml file generated by lstopo of versions 1.5 or greater
>>>>>
>>>>> * the devel series can correctly load an xml file generated by lstopo of versions 1.7 or greater, but not files generated by prior versions. In the latter case, I segfault as soon as I try to use the loaded topology.
>>>>>
>>>>> Any ideas why the discrepancy? Can I at least detect the version used to create a file when loading it so I can error out instead of segfaulting?
>>>>>
>>>>> Ralph
>>>>>
>>>>> _______________________________________________
>>>>> hwloc-devel mailing list
>>>>> hwloc-devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>> _______________________________________________
>>>> hwloc-devel mailing list
>>>> hwloc-devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>
>>>
>>> _______________________________________________
>>> hwloc-devel mailing list
>>> hwloc-devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>
>> _______________________________________________
>> hwloc-devel mailing list
>> hwloc-devel_at_[hidden] <mailto:hwloc-devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>
>
>
> _______________________________________________
> hwloc-devel mailing list
> hwloc-devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel