Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] Something lighter-weight than XML?
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2011-09-24 11:45:19


Indeed, this object contains invalid pointers.

Can you try to run tests/xmlbuffer.c from hwloc's tree? It does
export+import+export+compare on the same machine. It would be good to
know if it fails on one of the machines you're using here.

https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837&format=txt

thanks
Brice

Le 24/09/2011 17:07, Ralph Castain a écrit :
> FWIW: I tried just printing out the contents of that root object immediately after importing the xml, and it clearly has a problem:
>
> (gdb) print *obj
> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 <Address 0x101 out of bounds>, memory = {
> total_memory = 46912502995240, local_memory = 46912502995240, page_types_len = 0, page_types = 0x0}, attr = 0x2,
> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, prev_cousin = 0xffffffff, parent = 0x0,
> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, children = 0x2aaaab139738,
> first_child = 0x2aaaab139738, last_child = 0x0, userdata = 0x0, cpuset = 0x0, complete_cpuset = 0x0,
> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, complete_nodeset = 0x644c90,
> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = 6900000, infos = 0x0, infos_count = 0}
>
>
> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote:
>
>> Here's the trace:
>>
>> #0 0x00002aaaaae61737 in hwloc__xml_export_object (output=0x7fffffffd890, topology=0x695f10, obj=0x2aaaab139b28)
>> at topology-xml.c:1094
>> #1 0x00002aaaaae61b69 in hwloc___nolibxml_prepare_export (topology=0x695f10,
>> xmlbuffer=0x698a70 "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE topology SYSTEM \"hwloc.dtd\">\n<topology>\n <object type=\"Unknown\" os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" complete_cpuset=\"0xf...f\" onl"...,
>> buflen=16384) at topology-xml.c:1193
>> #2 0x00002aaaaae61be0 in hwloc__nolibxml_prepare_export (topology=0x695f10, bufferp=0x7fffffffd988, buflenp=0x7fffffffd97c)
>> at topology-xml.c:1207
>> #3 0x00002aaaaae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer (topology=0x695f10, xmlbuffer=0x7fffffffd988,
>> buflen=0x7fffffffd97c) at topology-xml.c:1281
>> #4 0x00002aaaaae529f4 in opal_hwloc_compare (topo1=0x695f10, topo2=0x6915c0, type=22 '\026') at base/hwloc_base_dt.c:183
>> #5 0x00002aaaaadf348c in opal_dss_compare (value1=0x695f10, value2=0x6915c0, type=22 '\026') at dss/dss_compare.c:39
>> #6 0x00002aaaaad9b5f7 in process_orted_launch_report (fd=-1, event=1, data=0x6444d0) at base/plm_base_launch_support.c:564
>> #7 0x00002aaaaae3881f in event_process_active_single_queue (base=0x60dd60, activeq=0x6111e0) at event.c:1329
>> #8 0x00002aaaaae38c71 in event_process_active (base=0x60dd60) at event.c:1396
>> #9 0x00002aaaaae3902b in opal_libevent2012_event_base_loop (base=0x60dd60, flags=1) at event.c:1598
>> #10 0x00002aaaaadf080d in opal_progress () at runtime/opal_progress.c:189
>> #11 0x00002aaaaad9bbfa in orte_plm_base_daemon_callback (num_daemons=2) at base/plm_base_launch_support.c:666
>> #12 0x00002aaaaada49e1 in plm_slurm_launch_job (jdata=0x67a500) at plm_slurm_module.c:404
>> #13 0x0000000000403822 in orterun (argc=4, argv=0x7fffffffe1d8) at orterun.c:817
>> #14 0x0000000000402aa3 in main (argc=4, argv=0x7fffffffe1d8) at main.c:13
>>
>> And the error report
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x00002aaaaae61737 in hwloc__xml_export_object (output=0x7fffffffd890, topology=0x695f10, obj=0x2aaaab139b28)
>> at topology-xml.c:1094
>> 1094 sprintf(tmp, "%llu", (unsigned long long) obj->memory.page_types[i].count);
>> (gdb) print obj
>> $1 = (opal_hwloc122_hwloc_obj_t) 0x2aaaab139b28
>> (gdb) print *obj
>> $2 = {type = 2870188824, os_index = 10922, name = 0x2aaaab139b18 "\b\233\023\253\252*", memory = {total_memory = 6579376,
>> local_memory = 6579376, page_types_len = 2870188856, page_types = 0x2aaaab139b38}, attr = 0x2aaaab139b48,
>> depth = 2870188872, logical_index = 10922, os_level = -1424778408, next_cousin = 0x2aaaab139b58,
>> prev_cousin = 0x2aaaab139b68, parent = 0x2aaaab139b68, sibling_rank = 2870188920, next_sibling = 0x2aaaab139b78,
>> prev_sibling = 0x2aaaab139b88, arity = 2870188936, children = 0x2aaaab139b98, first_child = 0x2aaaab139b98,
>> last_child = 0x2aaaab139ba8, userdata = 0x2aaaab139ba8, cpuset = 0x2aaaab139bb8, complete_cpuset = 0x2aaaab139bb8,
>> online_cpuset = 0x2aaaab139bc8, allowed_cpuset = 0x2aaaab139bc8, nodeset = 0x2aaaab139bd8,
>> complete_nodeset = 0x2aaaab139bd8, allowed_nodeset = 0x2aaaab139be8, distances = 0x2aaaab139be8,
>> distances_count = 2870189048, infos = 0x2aaaab139bf8, infos_count = 2870189064}
>> (gdb) print obj->memory
>> $3 = {total_memory = 6579376, local_memory = 6579376, page_types_len = 2870188856, page_types = 0x2aaaab139b38}
>> (gdb) print obj->memory.page_types
>> $4 = (struct opal_hwloc122_hwloc_obj_memory_page_type_s *) 0x2aaaab139b38
>> (gdb) print i
>> $5 = 1612
>> (gdb) print obj->memory.page_types[1600]
>> $6 = {size = 0, count = 0}
>> (gdb) print obj->memory.page_types[1612]
>> Cannot access memory at address 0x2aaaab13fff8
>> (gdb) print obj->memory.page_types[1611]
>> $7 = {size = 0, count = 0}
>> (gdb)
>>
>>
>> The whole obj looks like trash to me. I looked a little more - the object referenced is the root object:
>>
>> 1193 hwloc__xml_export_object (&output, topology, hwloc_get_root_obj(topology));
>>
>> I'm continuing to look in case I'm doing something stupid, but the code is pretty linear here - unpack, import, export for compare.
>>
>>
>> On Sep 24, 2011, at 8:59 AM, Jeff Squyres wrote:
>>
>>> Here's some feedback from Ralph -- any idea what's going wrong here?
>>>
>>> -----
>>>
>>> 1. I export a topology into xml using
>>>
>>> hwloc_topology_export_xmlbuffer(t, &xmlbuffer, &len);
>>>
>>> I then pack and send the string.
>>>
>>> 2. I unpack the string on the other end and import it into a topology
>>> hwloc_topology_init(&t);
>>> if (0 != (rc = hwloc_topology_set_xmlbuffer(t, xmlbuffer, strlen(xmlbuffer)))) {
>>> hwloc_topology_destroy(t);
>>> goto cleanup;
>>> }
>>> hwloc_topology_load(t);
>>>
>>> 3. I then need to compare two topologies, so I export the topology I received into another xml string
>>> hwloc_topology_export_xmlbuffer(t1, &x1, &l1);
>>>
>>> It is this export that fails, which implies to me that somehow the import didn't work right. Note that this code worked fine with libxml2, so this is a regression.
>>>
>>>
>>> On Sep 22, 2011, at 9:39 AM, Jeff Squyres wrote:
>>>
>>>> Yes, I can get some testing of the ompi branch pretty quickly. I can bring in a new copy of this later today and see what we can see.
>>>>
>>>> Many thanks!
>>>>
>>>>
>>>> On Sep 19, 2011, at 9:05 AM, Brice Goglin wrote:
>>>>
>>>>> I pushed the new minimalistic XML import/export implementation without
>>>>> libxml2 to the nolibxml branch. If libxml2 is available, it's still used
>>>>> by default. --disable-libxml2 or some env variables can be used for
>>>>> force the minimalistic implementation if needed. The minimalistic implem
>>>>> is only guaranteed to import XML files that were generated by hwloc
>>>>> (even if libxml was enabled there).
>>>>>
>>>>> I also backported most of this to the new v1.2-ompi branch (required to
>>>>> backport some other XML cleanups from trunk). This branch will now serve
>>>>> as a base for Open MPI's embedded hwloc. The idea is to have a complete
>>>>> v1.2 + nolibxml somewhere so that we can at least run make check (Open
>>>>> MPI does not embed enough to run hwloc's make check).
>>>>>
>>>>> How do we proceed now? Can we have the OMPI guys test the new code soon?
>>>>> Should I wait for their feedback before merging the nolibxml branch into
>>>>> the trunk? I'd like to merge this in v1.3 too (and basically release rc2
>>>>> as the actual first feature-complete RC), so getting feedback early
>>>>> might be appreciated.
>>>>>
>>>>> Brice
>>>>>
>>>>> _______________________________________________
>>>>> hwloc-devel mailing list
>>>>> hwloc-devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> hwloc-devel mailing list
>>>> hwloc-devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>
> _______________________________________________
> hwloc-devel mailing list
> hwloc-devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel