Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [RFC] Hierarchical Topology
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2010-11-15 10:53:04


On Mon, 15 Nov 2010, Terry Dontje wrote:

> A few comments:
>
> 1. Have you guys considered using hwloc for level 4-7 detection?
Yes, and I agree there may be something to improve on level 4-7 detection.
But note that hitopo differs from hwloc because it is not discovering the
whole machine, only where MPI processes have been spawned. More on this
after.

> 2. Is L2 related to L2 cache? If no then is there some other term you could
> use?
It is not L2 cache. However, claiming that L2 is always related to L2
cache is a bit exagerated in my opinion. The term in hitopo is "L2NUMA"
which seems clear to me. And there are L2 Infiniband switches, L2
support, ... :-)

> 3. What do you see if the process is bound to multiple cores/hyperthreads?
> 4. What do you see if the process is not bound to any level 4-7 items?
Currently (and this is not optimal), as soon as the process is not bound
to 1 core, the cpuid component returns nothing (no socket, no core). We
could improving this by returning only the socket when we are bound to a
socket.

When placement is not per-core, socket number will therefore be 0 and core
number will be renumbered by the "renumber" phase from 0 to N (N being the
number of MPI processes on the node).

Hyperthread are only used if two processes are bound on the same core (the
renumber phase will mark them as 0, 1, ...).

> 5. What about L1 and L2 cache locality as some levels? (hwloc exposes these
> but these are also at different depths depending on the platform).
This is something hitopo doesn't [want to] show. But we could imagine
calling hwloc to know what are the properties of MPI process on the same
core/socket/...

> Note I am working with Jeff Squyres and Josh Hursey on some new paffinity
> code that uses hwloc. Though the paffinity code may not have direct
> relationship to hitopo the use of hwloc and standardization of what you call
> level 4-7 might help avoid some user confusions.
I agree there is a big potential for confusion between hwloc, carto,
hitopo, ... One could think we should mutualise code, which is often not
possible or not what we want.

My (maybe incorrect) vision is that hwloc and carto discover the hardware
topology, i.e. what exists on the node (not what will be used). This is
used by placement modules or btls to know what resources to use when
launching processes.

HiTopo provides where (inside this discovery) MPI process end up being
spawned [btw, not only intra-node but also inter-node]. We could get this
information from Open MPI components that do the spawning, but since it is
not enough (resource manager may do part of the binding), we re-do the
discovery in the end.

To sum up, here is the complete picture as I see it :

[ 0. Resource manager restricts node/cpu/io/mem sets ]
   1. Hwloc discovers what's available for intra-node
   2. Spawning/placement is done by a combination of RMs, paffinity, ...
   3. HiTopo discovers what is used from intra- to inter- node.

Sylvain

> On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote:
>> As a followup of Stuttgart's developper's meeting, here is an RFC for our
>> topology detection framework.
>>
>> WHAT: Add a framework for hardware topology detection to be used by any
>> other part of Open MPI to help optimization.
>>
>> WHY: Collective operations or shared memory algorithms among others may
>> have optimizations depending on the hardware relationship between two MPI
>> processes. HiTopo is an attempt to provide it in a unified manner.
>>
>> WHERE: ompi/mca/hitopo/
>>
>> WHEN: When wanted.
>>
>> ==========================================================================
>> We developped the HiTopo framework for our collective operation component,
>> but it may be useful for other parts of Open MPI, so we'd like to
>> contribute it.
>>
>> A wiki page has been setup :
>> https://svn.open-mpi.org/trac/ompi/wiki/HiTopo
>>
>> and a bitbucket repository :
>> http://bitbucket.org/jeaugeys/hitopo/
>>
>> In a few words, we have 3 steps in HiTopo :
>>
>> - Detection : each MPI process detects its topology at various levels :
>> - core/socket : through the cpuid component
>> - node : through gethostname
>> - switch/island : through openib (mad) or slurm
>> [ Other topology detection components may be added for other
>> resource managers, specific hardware or whatever we want ...]
>>
>> - Collection : an allgather is performed to have all other processes'
>> addresses
>>
>> - Renumbering : "string" addresses are converted to numbers starting at 0
>> (Example : nodenames "foo" and "bar" are renamed 0 and 1).
>>
>> Any comment welcome,
>> Sylvain
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
>
>
>