Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] Something lighter-weight than XML?
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2011-09-24 16:18:59


I fixed one parsing bug in commit 3660 on the v1.2-ompi branch. Things
should work better now.

Parsing XML distance matrices was broken when the XML file came from the
no-libxml exporter. That's why you had problems on your dual-amd machine
(those have distance matrices) and not on your mac (single processor, no
distances, no bug).

The v1.2 branch doesn't report parsing failure well, so it just crashed.
Trunk exits with an error instead of crashing.

Brice

Le 24/09/2011 20:37, Ralph Castain a écrit :
> Yep, it fails. Runs on my Mac, but not under Linux.
>
> Program terminated with signal 11, Segmentation fault.
> #0 0x00002aaaaacdbedd in hwloc_bitmap_snprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
> (gdb) where
> #0 0x00002aaaaacdbedd in hwloc_bitmap_snprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
> #1 0x00002aaaaacdc060 in hwloc_bitmap_asprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
> #2 0x00002aaaaacd9b34 in hwloc__xml_export_object () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
> #3 0x00002aaaaacda325 in hwloc___nolibxml_prepare_export () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
> #4 0x00002aaaaacda39c in hwloc__nolibxml_prepare_export () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
> #5 0x00002aaaaacda4be in hwloc_topology_export_xmlbuffer () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3
> #6 0x00000000004009b8 in main () at xmlbuffer.c:31
>
> On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote:
>
>> Indeed, this object contains invalid pointers.
>>
>> Can you try to run tests/xmlbuffer.c from hwloc's tree? It does
>> export+import+export+compare on the same machine. It would be good to
>> know if it fails on one of the machines you're using here.
>>
>> https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837&format=txt
>>
>> thanks
>> Brice
>>
>>
>>
>> Le 24/09/2011 17:07, Ralph Castain a écrit :
>>> FWIW: I tried just printing out the contents of that root object immediately after importing the xml, and it clearly has a problem:
>>>
>>> (gdb) print *obj
>>> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 <Address 0x101 out of bounds>, memory = {
>>> total_memory = 46912502995240, local_memory = 46912502995240, page_types_len = 0, page_types = 0x0}, attr = 0x2,
>>> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, prev_cousin = 0xffffffff, parent = 0x0,
>>> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, children = 0x2aaaab139738,
>>> first_child = 0x2aaaab139738, last_child = 0x0, userdata = 0x0, cpuset = 0x0, complete_cpuset = 0x0,
>>> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, complete_nodeset = 0x644c90,
>>> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = 6900000, infos = 0x0, infos_count = 0}
>>>
>>>
>>> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote:
>>>
>>>> Here's the trace:
>>>>
>>>> #0 0x00002aaaaae61737 in hwloc__xml_export_object (output=0x7fffffffd890, topology=0x695f10, obj=0x2aaaab139b28)
>>>> at topology-xml.c:1094
>>>> #1 0x00002aaaaae61b69 in hwloc___nolibxml_prepare_export (topology=0x695f10,
>>>> xmlbuffer=0x698a70 "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE topology SYSTEM \"hwloc.dtd\">\n<topology>\n <object type=\"Unknown\" os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" complete_cpuset=\"0xf...f\" onl"...,
>>>> buflen=16384) at topology-xml.c:1193
>>>> #2 0x00002aaaaae61be0 in hwloc__nolibxml_prepare_export (topology=0x695f10, bufferp=0x7fffffffd988, buflenp=0x7fffffffd97c)
>>>> at topology-xml.c:1207
>>>> #3 0x00002aaaaae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer (topology=0x695f10, xmlbuffer=0x7fffffffd988,
>>>> buflen=0x7fffffffd97c) at topology-xml.c:1281
>>>> #4 0x00002aaaaae529f4 in opal_hwloc_compare (topo1=0x695f10, topo2=0x6915c0, type=22 '\026') at base/hwloc_base_dt.c:183
>>>> #5 0x00002aaaaadf348c in opal_dss_compare (value1=0x695f10, value2=0x6915c0, type=22 '\026') at dss/dss_compare.c:39
>>>> #6 0x00002aaaaad9b5f7 in process_orted_launch_report (fd=-1, event=1, data=0x6444d0) at base/plm_base_launch_support.c:564
>>>> #7 0x00002aaaaae3881f in event_process_active_single_queue (base=0x60dd60, activeq=0x6111e0) at event.c:1329
>>>> #8 0x00002aaaaae38c71 in event_process_active (base=0x60dd60) at event.c:1396
>>>> #9 0x00002aaaaae3902b in opal_libevent2012_event_base_loop (base=0x60dd60, flags=1) at event.c:1598
>>>> #10 0x00002aaaaadf080d in opal_progress () at runtime/opal_progress.c:189
>>>> #11 0x00002aaaaad9bbfa in orte_plm_base_daemon_callback (num_daemons=2) at base/plm_base_launch_support.c:666
>>>> #12 0x00002aaaaada49e1 in plm_slurm_launch_job (jdata=0x67a500) at plm_slurm_module.c:404
>>>> #13 0x0000000000403822 in orterun (argc=4, argv=0x7fffffffe1d8) at orterun.c:817
>>>> #14 0x0000000000402aa3 in main (argc=4, argv=0x7fffffffe1d8) at main.c:13
>>>>
>>>> And the error report
>>>>
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>> 0x00002aaaaae61737 in hwloc__xml_export_object (output=0x7fffffffd890, topology=0x695f10, obj=0x2aaaab139b28)
>>>> at topology-xml.c:1094
>>>> 1094 sprintf(tmp, "%llu", (unsigned long long) obj->memory.page_types[i].count);
>>>> (gdb) print obj
>>>> $1 = (opal_hwloc122_hwloc_obj_t) 0x2aaaab139b28
>>>> (gdb) print *obj
>>>> $2 = {type = 2870188824, os_index = 10922, name = 0x2aaaab139b18 "\b\233\023\253\252*", memory = {total_memory = 6579376,
>>>> local_memory = 6579376, page_types_len = 2870188856, page_types = 0x2aaaab139b38}, attr = 0x2aaaab139b48,
>>>> depth = 2870188872, logical_index = 10922, os_level = -1424778408, next_cousin = 0x2aaaab139b58,
>>>> prev_cousin = 0x2aaaab139b68, parent = 0x2aaaab139b68, sibling_rank = 2870188920, next_sibling = 0x2aaaab139b78,
>>>> prev_sibling = 0x2aaaab139b88, arity = 2870188936, children = 0x2aaaab139b98, first_child = 0x2aaaab139b98,
>>>> last_child = 0x2aaaab139ba8, userdata = 0x2aaaab139ba8, cpuset = 0x2aaaab139bb8, complete_cpuset = 0x2aaaab139bb8,
>>>> online_cpuset = 0x2aaaab139bc8, allowed_cpuset = 0x2aaaab139bc8, nodeset = 0x2aaaab139bd8,
>>>> complete_nodeset = 0x2aaaab139bd8, allowed_nodeset = 0x2aaaab139be8, distances = 0x2aaaab139be8,
>>>> distances_count = 2870189048, infos = 0x2aaaab139bf8, infos_count = 2870189064}
>>>> (gdb) print obj->memory
>>>> $3 = {total_memory = 6579376, local_memory = 6579376, page_types_len = 2870188856, page_types = 0x2aaaab139b38}
>>>> (gdb) print obj->memory.page_types
>>>> $4 = (struct opal_hwloc122_hwloc_obj_memory_page_type_s *) 0x2aaaab139b38
>>>> (gdb) print i
>>>> $5 = 1612
>>>> (gdb) print obj->memory.page_types[1600]
>>>> $6 = {size = 0, count = 0}
>>>> (gdb) print obj->memory.page_types[1612]
>>>> Cannot access memory at address 0x2aaaab13fff8
>>>> (gdb) print obj->memory.page_types[1611]
>>>> $7 = {size = 0, count = 0}
>>>> (gdb)
>>>>
>>>>
>>>> The whole obj looks like trash to me. I looked a little more - the object referenced is the root object:
>>>>
>>>> 1193 hwloc__xml_export_object (&output, topology, hwloc_get_root_obj(topology));
>>>>
>>>> I'm continuing to look in case I'm doing something stupid, but the code is pretty linear here - unpack, import, export for compare.
>>>>
>>>>
>>>> On Sep 24, 2011, at 8:59 AM, Jeff Squyres wrote:
>>>>
>>>>> Here's some feedback from Ralph -- any idea what's going wrong here?
>>>>>
>>>>> -----
>>>>>
>>>>> 1. I export a topology into xml using
>>>>>
>>>>> hwloc_topology_export_xmlbuffer(t, &xmlbuffer, &len);
>>>>>
>>>>> I then pack and send the string.
>>>>>
>>>>> 2. I unpack the string on the other end and import it into a topology
>>>>> hwloc_topology_init(&t);
>>>>> if (0 != (rc = hwloc_topology_set_xmlbuffer(t, xmlbuffer, strlen(xmlbuffer)))) {
>>>>> hwloc_topology_destroy(t);
>>>>> goto cleanup;
>>>>> }
>>>>> hwloc_topology_load(t);
>>>>>
>>>>> 3. I then need to compare two topologies, so I export the topology I received into another xml string
>>>>> hwloc_topology_export_xmlbuffer(t1, &x1, &l1);
>>>>>
>>>>> It is this export that fails, which implies to me that somehow the import didn't work right. Note that this code worked fine with libxml2, so this is a regression.
>>>>>
>>>>>
>>>>> On Sep 22, 2011, at 9:39 AM, Jeff Squyres wrote:
>>>>>
>>>>>> Yes, I can get some testing of the ompi branch pretty quickly. I can bring in a new copy of this later today and see what we can see.
>>>>>>
>>>>>> Many thanks!
>>>>>>
>>>>>>
>>>>>> On Sep 19, 2011, at 9:05 AM, Brice Goglin wrote:
>>>>>>
>>>>>>> I pushed the new minimalistic XML import/export implementation without
>>>>>>> libxml2 to the nolibxml branch. If libxml2 is available, it's still used
>>>>>>> by default. --disable-libxml2 or some env variables can be used for
>>>>>>> force the minimalistic implementation if needed. The minimalistic implem
>>>>>>> is only guaranteed to import XML files that were generated by hwloc
>>>>>>> (even if libxml was enabled there).
>>>>>>>
>>>>>>> I also backported most of this to the new v1.2-ompi branch (required to
>>>>>>> backport some other XML cleanups from trunk). This branch will now serve
>>>>>>> as a base for Open MPI's embedded hwloc. The idea is to have a complete
>>>>>>> v1.2 + nolibxml somewhere so that we can at least run make check (Open
>>>>>>> MPI does not embed enough to run hwloc's make check).
>>>>>>>
>>>>>>> How do we proceed now? Can we have the OMPI guys test the new code soon?
>>>>>>> Should I wait for their feedback before merging the nolibxml branch into
>>>>>>> the trunk? I'd like to merge this in v1.3 too (and basically release rc2
>>>>>>> as the actual first feature-complete RC), so getting feedback early
>>>>>>> might be appreciated.
>>>>>>>
>>>>>>> Brice
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> hwloc-devel mailing list
>>>>>>> hwloc-devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> jsquyres_at_[hidden]
>>>>>> For corporate legal information go to:
>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> hwloc-devel mailing list
>>>>>> hwloc-devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquyres_at_[hidden]
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>> _______________________________________________
>>> hwloc-devel mailing list
>>> hwloc-devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel