Hello,

The v1.2 branch has known problems with distance matrices when the topology is asymmetric (especially when Linux cpuset make some NUMA nodes CPU-less). This is what causes wrong relative_depth here. It can even be negative is some cases which is obviously wrong.

This should be fixed in v1.3 but it's NOT easy to backport in v1.2. Can you check that you can export and reimport with v1.3 properly? I will see if I can find a workaround for v1.2, but it will likely be something like ignore distance matrices if reldepth is <= 0.

In the meantime, you can remove "&& reldepth" from the "if" line below. It may help.

Brice



Le 02/11/2011 13:42, Jeff Squyres (jsquyres) a écrit :

> Hi Jeff,
>
> Brad mentioned you might be able to help me with an OMPI hwloc issue
> I'm having.
>
> Its occurring on a Power 5 RHEL 6.0 machine and related to the xml
> representation of the topology. I've attached the xml to this email.
> The problem only occurs on the trunk code.
>
> The part which appears to be the problem is this:
>
>      <distances nbobjs="4" relative_depth="0" latency_base="10.000000">
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>        <latency value="1.000000"/>
>      </distances>
>
> specifically with relative_depth having a value of 0, but still having
> latency children information. In hwloc__xml_import_distances in
> topology-xml.c there's a check that assumes there is no latency
> information.
>
> Around line 634 in topology-xml.c:
>
> if (nbobjs && reldepth && latbase) {
>    ... process latency xml nodes
> }
>
> return hwloc__xml_import_close_tag(state);
>
> The hwloc__xml_import_close_tag function returns a failure because the
> latency nodes have not been processed yet.
>
> I had a look in orted where the xml is created and it does look like
> the xml is being assembled correctly as per the topology information it
> has retrieved (though I don't know if that itself is correct). The
> hwloc__xml_export_object function will quite happily create distance
> information if the relative depth is 0 even though
> hwloc__xml_import_distance will not be able to parse it.
>
> So there is at least a problem that the topology code will create xml
> that it can't parse, but I don't know enough about the hwloc library to
> know if relative depth should always be positive. I suspect its the
> former which is the problem not the latter, but I don't know for sure...
>
> If it helps, this is the output of lstopo on the machine:
>
> cyeoh@p5-40-P4-E0:~$ /home/OpenHPC/hwloc/build/bin/lstopo
> Machine (2048MB)
>  NUMANode L#0 (P#0 512MB)
>    Socket L#0 + L1 L#0 (32KB) + Core L#0
>      PU L#0 (P#0)
>      PU L#1 (P#1)
>    Socket L#1 + L1 L#1 (32KB) + Core L#1
>      PU L#2 (P#2)
>      PU L#3 (P#3)
>  NUMANode L#1 (P#1 640MB)
>  NUMANode L#2 (P#2 512MB)
>  NUMANode L#3 (P#3 384MB)