Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Bcast issue
From: Randolph Pullen (randolph_pullen_at_[hidden])
Date: 2010-08-12 21:54:02

Sorry for the late replies but with work, time zones etc…

This post has been going on for a while and in an attempt bring it to a close I’m going to try to collapse this down to some core issues and answer all the questions in 1 place.

Richard: yes your last statement is correct, I am just using PVm solely as a launcher, the MPI worlds are semantically independent.

Jeffs suggestion that it may be a network congestion issue rings a bell somewhere.
Jeff, although it is possible to make a small example program, this would require PVM or some other method of launching MPI simultaneously on each node. I would agree that this is a bit off topic for this forum and so I won’t mention it further. 

In finalizing this issue, I would like to discuss the characteristics of the other options available.  If I understood what to expect from the alltoall on a large cluster and given the scenario outlined below it may help me greatly in deciding how (or if) to re-write this.

BTW:  Jeff,  sorry if I miss quoted you, I must have miss-understood.
>From your post reproduced in part here:
>- All of Open MPI's network-based collectives use point-to-point
>communications underneath (shared memory may not, but that's not the issue
>here).  One of the implementations is linear, meaning that the root sends the
>message to comm rank 1, then comm rank 2, ..etc.  But this algorithm is only
>used when the message is small, the number of peers is small, etc.  All the other
>algorithms are parallel in nature, meaning that after an iteration or two, >multiple processes have the data and can start pipelining sends to other
>processes, etc.

What I meant when I said b-tree is nearly right I think – I should have said ‘in an N-Tree manner’ but both would produce O(log N) solutions and I agree that these are all perfectly fine for almost everything.

This assumes that you have ‘adequate’ network bandwidth as you correctly pointed out in your recent post.  This may not be the case for my problem (see below)

The Problem:
-    A large all to all (N to N transmission or N broadcasts) of possibly hundreds of GB in total.
-    The cluster size my clients will use is unknown at this time but probably in the order of between 10 to a few hundred nodes.
-    The number of nodes is likely to increase with the data size but the ratio of data/node is unknown and variable.

My design Goals:
1.    Speed and accuracy are everything.  Accuracy is paramount but the system would become unusable if this algorithm became exponential.
2.    I love the flexibility OMPI brings to fabric deployment.  I want to pass on the richness of these choices to my clients/customers – however if IB (or some other) plugin solution moved the alltoall algorithm from say  O(N log N) to  just O(Log N) transmission, its mandatory use may be an acceptable solution on larger clusters

My Assumptions
1.    I can concentrate on providing the best near linear solutions and ignore site implementation peculiarities
2.    Tuning each installation can accommodate all site specific idiosyncrasies
3.    The solution will probably be network bound.  No mater how fast the network is, 100GB may well be too much for concurrent p2p transmissions to run in O(log N) time
[please feel free to trash my assumptions]

This is a difficult problem, I have written 3 solutions for this using different technologies and I have been unsatisfied with each so far.

Theoretically the problem can be solved in N broadcasts but [Jeff] as you point out, in practice, data loss is likely on the nodes who are not ready etc..  However a near O(N) solution  should be possible.

It appears that OMPI’s Bcast is O(log N) for N > a trivial number of nodes
So AlltoAll is probably at least O(N log N) – unless it utilises something other than p2p transmissions and its only O(N log N) if there is adequate bandwidth on the network fabric.

Do I have it correct?
Is alltoall going to work for me ?


--- On Fri, 13/8/10, Jeff Squyres <jsquyres_at_[hidden]> wrote:

From: Jeff Squyres <jsquyres_at_[hidden]>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <users_at_[hidden]>
Received: Friday, 13 August, 2010, 3:03 AM

Dick / all --

I just had a phone call with Ralph Castain who has had some additional off-list mails with Randolph.  Apparently, none of us understand the model that is being used here.  There are also apparently some confidentiality issues involved such that it might be difficult to publicly state enough information to allow the open community to understand, diagnose, and fix the issue.  So I'm not quite sure how to proceed here -- I'm afraid that I don't have the time or resources for private problem resolution in an unorthodox situation like this.

For example, I was under the impression that PVM was solely being used as a launcher.  This is apparently not the case -- the original code is a PVM job that has been modified to eventually call MPI_INIT.  I don't know how much more I can say on this open list.

Hence, I'm throughly confused as to the model that is being used at this point.  I don't think I can offer any further help unless a small [non-PVM] example is provided to the community that can show the problem.

I also asked a bunch of questions in a prior post that would be helpful to have answered before going further.

Sorry!  :-(

On Aug 12, 2010, at 9:32 AM, Richard Treumann wrote:

> You said  "separate MPI  applications doing 1 to > N broadcasts over PVM".  You do not mean you are using pvm_bcast though - right?
> If these N MPI applications are so independent that you could run one at a time or run them on N different clusters and still get the result you want (not the time to solution) then I cannot imagine how there could be cross talk.   
> I have been assuming that when you describe this as an NxN problem, you mean there is some desired interaction among the N MPI worlds.   
> If I have misunderstood and the N MPI worlds stared with N mpirun operations under PVM are each semantically independent of the other (N-1) then I am totally at a loss for an explanation.
> Dick Treumann  -  MPI Team           
> IBM Systems & Technology Group
> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846         Fax (845) 433-8363
> users-bounces_at_[hidden] wrote on 08/11/2010 08:59:16 PM:
> > [image removed]
> >
> > Re: [OMPI users] MPI_Bcast issue
> >
> > Randolph Pullen
> >
> > to:
> >
> > Open MPI Users
> >
> > 08/11/2010 09:01 PM
> >
> > Sent by:
> >
> > users-bounces_at_[hidden]
> >
> > Please respond to Open MPI Users
> >
> > I (a single user) am running N separate MPI  applications doing 1 to
> > N broadcasts over PVM, each MPI application is started on each
> > machine simultaneously by PVM - the reasons are back in the post history.
> >
> > The problem is that they somehow collide - yes I know this should
> > not happen, the question is why.
> >
> > --- On Wed, 11/8/10, Richard Treumann <treumann_at_[hidden]> wrote:
> >
> > From: Richard Treumann <treumann_at_[hidden]>
> > Subject: Re: [OMPI users] MPI_Bcast issue
> > To: "Open MPI Users" <users_at_[hidden]>
> > Received: Wednesday, 11 August, 2010, 11:34 PM
> >
> > Randolf
> >
> > I am confused about using multiple, concurrent mpirun operations. 
> > If there are M uses of mpirun and each starts N tasks (carried out
> > under pvm or any other way) I would expect you to have M completely
> > independent MPI jobs with N tasks (processes) each.  You could have
> > some root in each of the M MPI jobs do an MPI_Bcast to the other
> > N-1) in that job but there is no way in MPI (without using
> > accept.connect) to get tasks of job 0 to give data to tasks of jobs 1-(m-1).
> >
> > With M uses of mpirun, you have M worlds that are forever isolated
> > from the other M-1 worlds (again, unless you do accept/connect)
> >
> > In what sense are you treating this as an single MxN application?   
> > ( I use M & N to keep them distinct. I assume if M == N, we have your case)
> >
> >
> > Dick Treumann  -  MPI Team           
> > IBM Systems & Technology Group
> > Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > Tele (845) 433-7846         Fax (845) 433-8363
> >
> > -----Inline Attachment Follows-----
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> >
> >
> >
> >  _______________________________________________
> > users mailing list
> > users_at_[hidden]
> >
> users mailing list
> users_at_[hidden]

Jeff Squyres
For corporate legal information go to:
users mailing list