Open MPI logo

Network Locality Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Network Locality Devel mailing list

Subject: [netloc-devel] Fwd: Netloc For Collectives
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-12-06 08:26:28

(on the private netloc-devel group for now)

We've known about this from the beginning, and others keep bringing it up, too: that the full network map may be GIANT.

Let's consider that we have (at least) 3 or 4 areas where the entire map may be held:

1. a netloc daemon
2. mpirun
3. on each server-specific daemon (e.g., orteds or slurmds)
4. in each MPI process (limited, perhaps, to 1 MPI process per server)

#4 clearly sounds like a bad idea from a scalability perspective. But if -- for example -- only 1 MPI process has the entire map, how would you optimize collectives for communicators that do not involve that MPI process? It gets problematic from several directions.

#3 could be workable, but getting SLURM to expose that data to MPI -- or torqued or lsfd or ...<whatever>... -- could be challenging. orted could hold the data, but not ever MPI has such a concept. orted's copy of the map on each server may also be redundant with SLURM/Torque/LSF/etc.

#2 could be workable, especially with ORTE's new asynchronous progress behavior. When creating a new communicator, MPI procs could make specific queries of mpirun -- effectively offloading both the storage and computation on the map to mpirun. This would probably *dramatically* slow down communicator creation, but this behavior could also be disabled via MCA param or something.

#1 Is effectively the same as #2, but it would be centralized for the whole HPC cluster/environment rather than a per-MPI-job basis. It would allow sysadmins to dedicate resources to these network computations (especially if they get computationally expensive, which could make #2 icky if mpirun is on the same server as MPI processes).

Perhaps a blend of #1 and #2 is a good idea. ...I'm leaping ahead to imagine a scenario where you might want to split the load between a cluster-wide resource and a per-job resource... hmmm...

This is something we should discuss.

At a minimum, #1 and/or #2 assume that we have network-queryable operations for clients that do not have a copy of the map, which would be a new concept for netloc.

Begin forwarded message:

From: Brice Goglin <Brice.Goglin_at_[hidden]<mailto:Brice.Goglin_at_[hidden]>>
Subject: Re: Netloc For Collectives
Date: December 6, 2013 4:15:39 AM EST
To: Joshua Hursey <jjhursey_at_[hidden]<mailto:jjhursey_at_[hidden]>>
Cc: <miked_at_[hidden]<mailto:miked_at_[hidden]>>, Jeff Squyres <jsquyres_at_[hidden]<mailto:jsquyres_at_[hidden]>>, <joshual_at_[hidden]<mailto:joshual_at_[hidden]>>, <yosefe_at_[hidden]<mailto:yosefe_at_[hidden]>>, <richardg_at_[hidden]<mailto:richardg_at_[hidden]>>

Yes, sorry, I am in South America until next monday and the Internet access isn't as good as expected.
My general feeling about this is that we may end up having the generic netloc API for "random network graphs" and then add some specific APIs for "regular topologies" such as fat-trees, torus, etc. I don't know yet if we'll be able to auto-detect these regular topologies or just let the user help us.


Le 05/12/2013 16:22, Joshua Hursey a écrit :

I think Brice is traveling, so he asked if I could jump in on this thread.

netloc is not currently part of Open MPI, but we hope that it will be one day when it is ready. At this stage we are still refining the interface based upon feedback from folks like you.

Once in Open MPI, it may be exposed in a similar way as the hwloc data, but there are some additional design complexities that would need to be considered. But nothing that we cannot figure out.

Regarding the interface you proposed, the idea of exposing a sub-tree topology from a list of hosts is interesting both for collectives and for schedulers. My only concern with doing so is if there are multiple paths between switches (at any given level for a given set of nodes) then how do we choose which one to represent in the resulting graph. I suppose that we could use the shortest physical path or the logical paths (if we have them, which for IB we do) information we cache between all of the nodes. It might be a bit computationally expensive depending upon the size of the network and the size of the hostlist.

I think we could prototype this interface as a library above netloc to explore the complexity of the subgraph partitioning algorithm (which it seems like this would be related to). Then if it is not too network specific we could consider bringing it into the netloc base. This development path is beneficial in two ways. First, it would give us a separate space to play with the interface to the tree sub-graph routines and explore ways to maybe use existing libraries (boost, parmetis, etc..) to improve performance. Second, it would help exercise the current netloc interfaces to see where they might be better improved to supporting such a library.

What are your thoughts on that?


-------- Message original --------
Sujet: RE: configure error when using knem from git
Date : Mon, 2 Dec 2013 20:25:12 +0000
De : Mike Dubman <miked_at_[hidden]><mailto:miked_at_[hidden]>
Pour : Brice Goglin <Brice.Goglin_at_[hidden]><mailto:Brice.Goglin_at_[hidden]>
Copie à : Joshua Ladd <joshual_at_[hidden]><mailto:joshual_at_[hidden]>, Yossi Etigin <yosefe_at_[hidden]><mailto:yosefe_at_[hidden]>, Richard Graham <richardg_at_[hidden]><mailto:richardg_at_[hidden]>

Hi Brice,
It was nice to meet you at SC.

Thanks for help.

I wonder if we can discuss netloc use for topology aware collective packages.
Does netloc now a part of OMPI tree? Do you plan to put it there?
We would like to have general access API in netloc, something like this:

Get topology sub-tree containing hosts specified by hostlist
The result tree can contain switches, example – two nodes connected to switch
In case there are many roots for the tree (fat-tree topology) – pick one randomly.

        / \
     S S
  / \ / \

NodeTree *get_physical_tree(hostlist)

Reload tree from topology source

Int refresh_tree()

Please comment.

Kind Regards,

Mike Dubman | R&D Director, HPC
Tel: +972 (74) 712 9214 | Fax: +972 (74) 712 9111
Mellanox Ltd. 13 Zarchin St., Bldg B, Raanana 43662, Israel

Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse>
Jeff Squyres
For corporate legal information go to: