Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] xml file load incompatibilities
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-09-21 17:07:43


Okay, I found it - was a sequencing problem in OMPI itself (we "set" the new topology too late in the setup sequence). Sorry for false alarm.

Thanks for the help!
Ralph

On Sep 20, 2013, at 11:36 PM, Brice Goglin <Brice.Goglin_at_[hidden]> wrote:

> Strange, the backtrace below looks total crazy, I don't see how debug checks could still pass in that case.
> Any chance you valgrind that thing?
>
> Brice
>
>
>
> Le 21/09/2013 01:09, Ralph Castain a écrit :
>> Hmmm...nope, not a peep (no extra output at all). Just segfaulted like before.
>>
>> On Sep 20, 2013, at 4:06 PM, Brice Goglin <Brice.Goglin_at_[hidden]> wrote:
>>
>>> Try adding HWLOC_DEBUG_CHECK=1 in your environment, it will enable many assertions at the end of hwloc_topology_load()
>>>
>>> Brice
>>>
>>>
>>>
>>> Le 21/09/2013 01:03, Ralph Castain a écrit :
>>>> I didn't try loading it with lstopo - just tried the OMPI trunk. It loads okay, but segfaults when you try to find an object by depth
>>>>
>>>> #0 0x00000001005fe5dc in opal_hwloc172_hwloc_get_obj_by_depth (topology=Cannot access memory at address 0xfffffffffffffff7
>>>> ) at traversal.c:623
>>>> #1 0x0000000100b6dfaa in opal_hwloc172_hwloc_get_root_obj (topology=Cannot access memory at address 0xfffffffffffffff7
>>>> ) at rmaps_rr_mappers.c:747
>>>> #2 0x0000000100b6e139 in orte_rmaps_rr_byslot (jdata=Cannot access memory at address 0xffffffffffffff77
>>>> ) at rmaps_rr_mappers.c:774
>>>> #3 0x0000000100b6d6da in orte_rmaps_rr_map (jdata=Cannot access memory at address 0xffffffffffffff17
>>>> ) at rmaps_rr.c:211
>>>> #4 0x0000000100353098 in orte_rmaps_base_map_job (fd=Cannot access memory at address 0xfffffffffffffe7b
>>>> ) at base/rmaps_base_map_job.c:320
>>>> #5 0x00000001005ce28c in event_process_active_single_queue (base=Cannot access memory at address 0xffffffffffffffe7
>>>> ) at event.c:1367
>>>> #6 0x00000001005ce500 in event_process_active (base=Cannot access memory at address 0xffffffffffffffe7
>>>> ) at event.c:1437
>>>> #7 0x00000001005ceb71 in opal_libevent2021_event_base_loop (base=Cannot access memory at address 0xffffffffffffffb7
>>>> ) at event.c:1645
>>>> #8 0x00000001002c5158 in orterun (argc=Cannot access memory at address 0xfffffffffffffd1b
>>>> ) at orterun.c:3039
>>>> #9 0x00000001002c32a4 in main (argc=Cannot access memory at address 0xfffffffffffffffb
>>>> ) at main.c:14
>>>>
>>>> Looks to me like memory may be getting hosed
>>>>
>>>>
>>>> On Sep 20, 2013, at 2:59 PM, Brice Goglin <Brice.Goglin_at_[hidden]> wrote:
>>>>
>>>>> I can't see any segfault. Where does the segfault occurs for you? In OMPI only (or lstopo too)? When loading or when using the topology?
>>>>>
>>>>> I tried lstopo on that file with and without HWLOC_NO_LIBXML_IMPORT=1 (in case the bug is in one of XML backends), looks ok.
>>>>>
>>>>> Brice
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Le 20/09/2013 23:53, Ralph Castain a écrit :
>>>>>> Here are the two files I tried - not from the same machine. The foo.xml works, the topo.xml segfaults
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> One of our users reported it from their machine, but I don't have their topo file.
>>>>>>
>>>>>> On Sep 20, 2013, at 2:41 PM, Brice Goglin <Brice.Goglin_at_[hidden]> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>> I don't see anything reason for such an incompatibility. But there are
>>>>>>> many combinations, we can't test everything.
>>>>>>> I can't reproduce that on my machines. Can you send the XML output of
>>>>>>> both versions on one of your machines?
>>>>>>> Brice
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le 20/09/2013 23:32, Ralph Castain a écrit :
>>>>>>>> Hi folks
>>>>>>>>
>>>>>>>> I've run across a rather strange behavior. We have two branches in OMPI - the devel trunk (using hwloc v1.7.2) and our feature release series (using hwloc 1.5.2). I have found the following:
>>>>>>>>
>>>>>>>> *the feature series can correctly load an xml file generated by lstopo of versions 1.5 or greater
>>>>>>>>
>>>>>>>> * the devel series can correctly load an xml file generated by lstopo of versions 1.7 or greater, but not files generated by prior versions. In the latter case, I segfault as soon as I try to use the loaded topology.
>>>>>>>>
>>>>>>>> Any ideas why the discrepancy? Can I at least detect the version used to create a file when loading it so I can error out instead of segfaulting?
>>>>>>>>
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> hwloc-devel mailing list
>>>>>>>> hwloc-devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>>>> _______________________________________________
>>>>>>> hwloc-devel mailing list
>>>>>>> hwloc-devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> hwloc-devel mailing list
>>>>>> hwloc-devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>>
>>>>> _______________________________________________
>>>>> hwloc-devel mailing list
>>>>> hwloc-devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> hwloc-devel mailing list
>>>> hwloc-devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>
>>> _______________________________________________
>>> hwloc-devel mailing list
>>> hwloc-devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>
>>
>>
>> _______________________________________________
>> hwloc-devel mailing list
>> hwloc-devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>
> _______________________________________________
> hwloc-devel mailing list
> hwloc-devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel