Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: [hwloc-devel] Fwd: hwloc problem
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-11-03 10:50:29


Somehow Chris' mail didn't make it back to the list (perhaps it got rejected if he's not subscribed).

Begin forwarded message:

> From: Christopher Yeoh <cyeoh_at_[hidden]>
> Date: November 3, 2011 2:59:34 AM EDT
> To: Jeff Squyres <jsquyres_at_[hidden]>
> Cc: Hardware locality development list <hwloc-devel_at_[hidden]>, Brad Benton <brad.benton_at_[hidden]>
> Subject: Re: [hwloc-devel] hwloc problem
>
> Hi Jeff,
>
> The patch fixes the crash for me. Thanks Brice!
>
> Regards,
>
> Chris
>
> On Wed, 2 Nov 2011 10:23:32 -0400
> Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
>> Chris --
>>
>> Can you verify the attached patch? If so, I'll commit it to the SVN
>> trunk and the pending OMPI v1.5 patch.
>>
>>
>> On Nov 2, 2011, at 10:05 AM, Brice Goglin wrote:
>>
>>> If we can't find any other way, filtering (during export) would be
>>> an easy solution.
>>>
>>> For the v1.2 branch, the attached patch seems to help. It just
>>> prevents the creation of internal matrices with invalid relative
>>> depth. No internal matrix, means no XML export, which means you
>>> don't break your import.
>>>
>>> Brice
>>>
>>>
>>>
>>>
>>> Le 02/11/2011 14:59, Jeff Squyres a écrit :
>>>> Should we just filter out the "distance" attribute in the XML on
>>>> the v1.2ompi branch? We're not using it (yet) in OMPI.
>>>>
>>>> On Nov 2, 2011, at 9:32 AM, Brice Goglin wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> The v1.2 branch has known problems with distance matrices when
>>>>> the topology is asymmetric (especially when Linux cpuset make
>>>>> some NUMA nodes CPU-less). This is what causes wrong
>>>>> relative_depth here. It can even be negative is some cases which
>>>>> is obviously wrong.
>>>>>
>>>>> This should be fixed in v1.3 but it's NOT easy to backport in
>>>>> v1.2. Can you check that you can export and reimport with v1.3
>>>>> properly? I will see if I can find a workaround for v1.2, but it
>>>>> will likely be something like ignore distance matrices if
>>>>> reldepth is <= 0.
>>>>>
>>>>> In the meantime, you can remove "&& reldepth" from the "if" line
>>>>> below. It may help.
>>>>>
>>>>> Brice
>>>>>
>>>>>
>>>>>
>>>>> Le 02/11/2011 13:42, Jeff Squyres (jsquyres) a écrit :
>>>>>>>> Hi Jeff,
>>>>>>>>
>>>>>>>> Brad mentioned you might be able to help me with an OMPI hwloc
>>>>>>>> issue I'm having.
>>>>>>>>
>>>>>>>> Its occurring on a Power 5 RHEL 6.0 machine and related to the
>>>>>>>> xml representation of the topology. I've attached the xml to
>>>>>>>> this email. The problem only occurs on the trunk code.
>>>>>>>>
>>>>>>>> The part which appears to be the problem is this:
>>>>>>>>
>>>>>>>> <distances nbobjs="4" relative_depth="0"
>>>>>>>> latency_base="10.000000"> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> <latency value="1.000000"/>
>>>>>>>> </distances>
>>>>>>>>
>>>>>>>> specifically with relative_depth having a value of 0, but
>>>>>>>> still having latency children information. In
>>>>>>>> hwloc__xml_import_distances in topology-xml.c there's a check
>>>>>>>> that assumes there is no latency information.
>>>>>>>>
>>>>>>>> Around line 634 in topology-xml.c:
>>>>>>>>
>>>>>>>> if (nbobjs && reldepth && latbase) {
>>>>>>>> ... process latency xml nodes
>>>>>>>> }
>>>>>>>>
>>>>>>>> return hwloc__xml_import_close_tag(state);
>>>>>>>>
>>>>>>>> The hwloc__xml_import_close_tag function returns a failure
>>>>>>>> because the latency nodes have not been processed yet.
>>>>>>>>
>>>>>>>> I had a look in orted where the xml is created and it does
>>>>>>>> look like the xml is being assembled correctly as per the
>>>>>>>> topology information it has retrieved (though I don't know if
>>>>>>>> that itself is correct). The hwloc__xml_export_object function
>>>>>>>> will quite happily create distance information if the relative
>>>>>>>> depth is 0 even though hwloc__xml_import_distance will not be
>>>>>>>> able to parse it.
>>>>>>>>
>>>>>>>> So there is at least a problem that the topology code will
>>>>>>>> create xml that it can't parse, but I don't know enough about
>>>>>>>> the hwloc library to know if relative depth should always be
>>>>>>>> positive. I suspect its the former which is the problem not
>>>>>>>> the latter, but I don't know for sure...
>>>>>>>>
>>>>>>>> If it helps, this is the output of lstopo on the machine:
>>>>>>>>
>>>>>>>> cyeoh_at_p5-40-P4-E0:~$ /home/OpenHPC/hwloc/build/bin/lstopo
>>>>>>>> Machine (2048MB)
>>>>>>>> NUMANode L#0 (P#0 512MB)
>>>>>>>> Socket L#0 + L1 L#0 (32KB) + Core L#0
>>>>>>>> PU L#0 (P#0)
>>>>>>>> PU L#1 (P#1)
>>>>>>>> Socket L#1 + L1 L#1 (32KB) + Core L#1
>>>>>>>> PU L#2 (P#2)
>>>>>>>> PU L#3 (P#3)
>>>>>>>> NUMANode L#1 (P#1 640MB)
>>>>>>>> NUMANode L#2 (P#2 512MB)
>>>>>>>> NUMANode L#3 (P#3 384MB)
>>>>> _______________________________________________
>>>>> hwloc-devel mailing list
>>>>> hwloc-devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>>
>>>
>>> <ignore_invalid_reldepth.patch>_______________________________________________
>>> hwloc-devel mailing list
>>> hwloc-devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>
>>
>
>
>
> --
> cyeoh_at_[hidden]

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/