Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [RFC] Hierarchical Topology
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2010-11-16 03:23:35

On Mon, 15 Nov 2010, Ralph Castain wrote:

> Guess I am a little confused. Every MPI process already has full knowledge
> of what node all other processes are located on - this has been true for
> quite a long time.
Ok, I didn't see that.

> Once my work is complete, mpirun will have full knowledge of each node's
> hardware resources. Terry will then use that in mpirun's mappers. The
> resulting launch message will contain a full mapping of procs to cores -
> i.e., every daemon will know the core placement of every process in the job.
> That info will be passed down to each MPI proc. Thus, upon launch, every MPI
> process will know not only the node for each process, but also the hardware
> resources of that node, and the bindings of every process in the job to that
> hardware.

Some things bug me however :
  1. What if the placement has been done by a wrapper script or by the
resource manager ? I.e. how do you know where MPI procs are located ?
  2. How scalable is it ? I would think there an allgather with 1 process
per node ; am I right ?
  3. How is that information represented ? As a graph ?

> So the only thing missing is the switch topology of the cluster (the
> inter-node topology). We modified carto a while back to support input of
> switch topology information, though I'm not sure how many people ever used
> that capability - not much value in it so far. We just set it up so that
> people could describe the topology, and then let carto compute hop distance.
Ok. I didn't know we also had some work on switches in carto.

This helps !

So, I'm now wondering if both work, which would seem similar are really
redundant. We though about this before starting hitopo, and since a graph
didn't fit our needs, we started work towards computing an address.
Perhaps hitopo addresses could be computed using hwloc's graph.

I understand that for sm optimization, hwloc is richer. The only thing
that bugs me is how much time it takes to figure out what capability I
have between process A and B. The great thing in hitopo is that a single
comparison can give you a property of two processes (e.g. they are on the
same socket).

Anyway, I just wanted to present hitopo in case someone would need it. And
I think hitopo's prefered domain remains collectives, where you do not
really need distances, but groups which share a certain locality.


> On Mon, Nov 15, 2010 at 9:00 AM, Sylvain Jeaugey
> <sylvain.jeaugey_at_[hidden]>wrote:
>> I already mentionned it answering Terry's e-mail, but to be sure I'm clear
>> : don't confuse node full topology with MPI job topology. It _is_ different.
>> And every process does not get the whole topology in hitopo, only its own,
>> which should not cause storms.
>> On Mon, 15 Nov 2010, Ralph Castain wrote:
>> I think the two efforts (the paffinity and this one) do overlap somewhat.
>>> I've been writing the local topology discovery code for Jeff, Terry, and
>>> Josh - uses hwloc (or any other method - it's a framework) to discover
>>> what
>>> hardware resources are available on each node in the job so that the info
>>> can be used in mapping the procs.
>>> As part of that work, we are passing down to the mpi processes the local
>>> hardware topology. This is done because of prior complaints when we had
>>> each
>>> mpi process discover that info for itself - it creates a bit of a "storm"
>>> on
>>> the node of large smp's.
>>> Note that what I've written (still to be completed before coming over)
>>> doesn't tell the proc what cores/HT's it is bound to - that's the part
>>> Terry
>>> et al are adding. Nor were we discovering the switch topology of the
>>> cluster.
>>> So a little overlap that could be resolved. And a concern on my part: we
>>> have previously introduced capabilities that had every mpi process read
>>> local system files to get node topology, and gotten user complaints about
>>> it. We probably shouldn't go back to that practice.
>>> Ralph
>>> On Mon, Nov 15, 2010 at 8:15 AM, Terry Dontje <terry.dontje_at_[hidden]
>>>> wrote:
>>> A few comments:
>>>> 1. Have you guys considered using hwloc for level 4-7 detection?
>>>> 2. Is L2 related to L2 cache? If no then is there some other term you
>>>> could use?
>>>> 3. What do you see if the process is bound to multiple
>>>> cores/hyperthreads?
>>>> 4. What do you see if the process is not bound to any level 4-7 items?
>>>> 5. What about L1 and L2 cache locality as some levels? (hwloc exposes
>>>> these but these are also at different depths depending on the platform).
>>>> Note I am working with Jeff Squyres and Josh Hursey on some new paffinity
>>>> code that uses hwloc. Though the paffinity code may not have direct
>>>> relationship to hitopo the use of hwloc and standardization of what you
>>>> call
>>>> level 4-7 might help avoid some user confusions.
>>>> --td
>>>> On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote:
>>>> As a followup of Stuttgart's developper's meeting, here is an RFC for our
>>>> topology detection framework.
>>>> WHAT: Add a framework for hardware topology detection to be used by any
>>>> other part of Open MPI to help optimization.
>>>> WHY: Collective operations or shared memory algorithms among others may
>>>> have optimizations depending on the hardware relationship between two MPI
>>>> processes. HiTopo is an attempt to provide it in a unified manner.
>>>> WHERE: ompi/mca/hitopo/
>>>> WHEN: When wanted.
>>>> ==========================================================================
>>>> We developped the HiTopo framework for our collective operation
>>>> component,
>>>> but it may be useful for other parts of Open MPI, so we'd like to
>>>> contribute
>>>> it.
>>>> A wiki page has been setup :
>>>> and a bitbucket repository :
>>>> In a few words, we have 3 steps in HiTopo :
>>>> - Detection : each MPI process detects its topology at various levels :
>>>> - core/socket : through the cpuid component
>>>> - node : through gethostname
>>>> - switch/island : through openib (mad) or slurm
>>>> [ Other topology detection components may be added for other
>>>> resource managers, specific hardware or whatever we want ...]
>>>> - Collection : an allgather is performed to have all other processes'
>>>> addresses
>>>> - Renumbering : "string" addresses are converted to numbers starting at
>>>> 0
>>>> (Example : nodenames "foo" and "bar" are renamed 0 and 1).
>>>> Any comment welcome,
>>>> Sylvain
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> --
>>>> [image: Oracle]
>>>> Terry D. Dontje | Principal Software Engineer
>>>> Developer Tools Engineering | +1.781.442.2631
>>>> Oracle * - Performance Technologies*
>>>> 95 Network Drive, Burlington, MA 01803
>>>> Email terry.dontje_at_[hidden]
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]