Open MPI logo

Network Locality Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Network Locality Devel mailing list

Subject: Re: [netloc-devel] Netloc For Collectives
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-12-06 09:14:12


On Dec 6, 2013, at 8:26 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> (on the private netloc-devel group for now)

Hah. Well, too bad I screwed up in my email client and sent this to the public list (the intent was to talk this over as a small group first). Oh well! :-)

So here's a glimpse of the things that we're still talking about. Comments welcome on the text / ideas below...

> We've known about this from the beginning, and others keep bringing it up, too: that the full network map may be GIANT.
>
> Let's consider that we have (at least) 3 or 4 areas where the entire map may be held:
>
> 1. a netloc daemon
> 2. mpirun
> 3. on each server-specific daemon (e.g., orteds or slurmds)
> 4. in each MPI process (limited, perhaps, to 1 MPI process per server)
>
> #4 clearly sounds like a bad idea from a scalability perspective. But if -- for example -- only 1 MPI process has the entire map, how would you optimize collectives for communicators that do not involve that MPI process? It gets problematic from several directions.
>
> #3 could be workable, but getting SLURM to expose that data to MPI -- or torqued or lsfd or ...<whatever>... -- could be challenging. orted could hold the data, but not ever MPI has such a concept. orted's copy of the map on each server may also be redundant with SLURM/Torque/LSF/etc.
>
> #2 could be workable, especially with ORTE's new asynchronous progress behavior. When creating a new communicator, MPI procs could make specific queries of mpirun -- effectively offloading both the storage and computation on the map to mpirun. This would probably *dramatically* slow down communicator creation, but this behavior could also be disabled via MCA param or something.
>
> #1 Is effectively the same as #2, but it would be centralized for the whole HPC cluster/environment rather than a per-MPI-job basis. It would allow sysadmins to dedicate resources to these network computations (especially if they get computationally expensive, which could make #2 icky if mpirun is on the same server as MPI processes).
>
> Perhaps a blend of #1 and #2 is a good idea. ...I'm leaping ahead to imagine a scenario where you might want to split the load between a cluster-wide resource and a per-job resource... hmmm...
>
> This is something we should discuss.
>
> At a minimum, #1 and/or #2 assume that we have network-queryable operations for clients that do not have a copy of the map, which would be a new concept for netloc.
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/