Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] is there an equiv of iprove for bcast?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-05-11 09:12:21

I'm not so much worried about the "load" than N pending ibcasts would cause; the "load" will be zero until the broadcast actually fires. But I'm concerned about the pending resource usage (i.e., how many internal network and collective resources will be slurped up into hundreds or thousands of pending broadcasts).

You might want to have a tiered system, instead. Have a tree-based communication pattern where each worker has a "parent" who does the actual broadcasting; each broadcaster can have tens of children (for example). Even have an N-level tree, perhaps even gathering your children by server rack and/or network topology.

That way, you can have a small number of processes at the top of the tree that do an actual broadcast. The rest can use a (relatively) small number of non-blocking sends and receives. Or, when non-blocking collectives become available, you can have everyone in pending ibcasts with the small number of broadcasters (i.e., N broadcasters for M processes, where N << M), which wouldn't be nearly as resource-consuming-heavy as M pending ibasts.

Or something like that... just throwing some ideas out there for you...

On May 10, 2011, at 7:14 PM, Randolph Pullen wrote:

> Thanks,
> The messages are small and frequent (they flash metadata across the cluster). The current approach works fine for small to medium clusters but I want it to be able to go big. Maybe up to several hundred or even a thousands of nodes.
> Its these larger deployments that concern me. The current scheme may see the clearinghouse become overloaded in a very large cluster.
> From what you have said, a possible strategy may be to combine the listener and worker into a single process, using the non-blocking bcast just for that group, while each worker scanned its own port for an incoming request, which it would in turn bcast to its peers.
> As you have indicated though, this would depend on the load the non-blocking bcast would cause. - At least the load would be fairly even over the cluster.
> --- On Mon, 9/5/11, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] is there an equiv of iprove for bcast?
> To: randolph_pullen_at_[hidden]
> Cc: "Open MPI Users" <users_at_[hidden]>
> Received: Monday, 9 May, 2011, 11:27 PM
> On May 3, 2011, at 8:20 PM, Randolph Pullen wrote:
> > Sorry, I meant to say:
> > - on each node there is 1 listener and 1 worker.
> > - all workers act together when any of the listeners send them a request.
> > - currently I must use an extra clearinghouse process to receive from any of the listeners and bcast to workers, this is unfortunate because of the potential scaling issues
> >
> > I think you have answered this in that I must wait for MPI-3's non-blocking collectives.
> Yes and no. If each worker starts N non-blocking broadcasts just to be able to test for completion of any of them, you might end up consuming a bunch of resources for them (I'm *anticipating* that pending non-blocking collective requests maybe more heavyweight than pending non-blocking point-to-point requests).
> But then again, if N is small, it might not matter.
> > Can anyone suggest another way? I don't like the serial clearinghouse approach.
> If you only have a few workers and/or the broadcast message is small and/or the broadcasts aren't frequent, then MPI's built-in broadcast algorithms might not offer much more optimization than doing your own with point-to-point mechanisms. I don't usually recommend this, but it may be possible for your case.
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
For corporate legal information go to: