I understand the problem with descriptor flooding can be serious in an
application with unidirectional data dependancy. Perhaps we have a
different perception of how common that is.
It seems to me that such programs would be very rare but if they are more
common than I imagine, then discussion of how to modulate them is
worthwhile. In many cases, I think that adding some flow control to the
application is a better solution than semantically redundant barrier. (A
barrier that is there only to affect performance, not correctness, is what
I mean by semantically redundant)
For example, a Master/Worker application could have each worker break
after every 4th send to the master and post an MPI_Recv for an
OK_to_continue token. If the token had been sent, this would delay the
worker a few microseconds. If it had not been sent, the worker would be
The Master would keep track of how many messages from each worker it had
absorbed and on message 3 from a particular worker, send an OK_to_continue
token to that worker. The master would keep sending OK_to_continue tokens
every 4th recv from then on (7, 11, 15 ...) The descriptor queues would
all remain short and only a worker that the master could not keep up with
would ever lose a chance to keep working. By sending the OK_to_continue
token a bit early, the application would ensure that when there was no
backlog, every worker would find a token when it looked for it and there
would be no significant loss of compute time.
Even with non-blocking barrier and a 10 step lag between Ibarrier and
Wait, , if some worker is slow for 12 steps, the fast workers end up being
kept in a non-productive MPI_Wait.
Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
users-bounces_at_[hidden] wrote on 09/09/2010 05:34:15 PM:
> [image removed]
> Re: [OMPI users] MPI_Reduce performance
> Ashley Pittman
> Open MPI Users
> 09/09/2010 05:37 PM
> Sent by:
> Please respond to Open MPI Users
> On 9 Sep 2010, at 21:40, Richard Treumann wrote:
> > Ashley
> > Can you provide an example of a situation in which these
> semantically redundant barriers help?
> I'm not making the case for semantically redundant barriers, I'm
> making a case for implicit synchronisation in every iteration of a
> application. Many applications have this already by nature of the
> data-flow required, anything that calls mpi_allgather or
> mpi_allreduce are the easiest to verify but there are many other
> ways of achieving the same thing. My point is about the subset of
> programs which don't have this attribute and are therefore
> susceptible to synchronisation problems. It's my experience that
> for low iteration counts these codes can run fine but once they hit
> a problem they go over a cliff edge performance wise and there is no
> way back from there until the end of the job. The email from
> Gabriele would appear to be a case that demonstrates this problem
> but I've seen it many times before.
> Using your previous email as an example I would describe adding
> barriers to a problem as a way artificially reducing the
> "elasticity" of the program to ensure balanced use of resources.
> > I may be missing something but my statement for the text book would be
> > "If adding a barrier to your MPI program makes it run faster,
> there is almost certainly a flaw in it that is better solved another
> > The only exception I can think of is some sort of one direction
> data dependancy with messages small enough to go eagerly. A program
> that calls MPI_Reduce with a small message and the same root every
> iteration and calls no other collective would be an example.
> > In that case, fast tasks at leaf positions would run free and a
> slow task near the root could pile up early arrivals and end up with
> some additional slowing. Unless it was driven into paging I cannot
> imagine the slowdown would be significant though.
> I've diagnosed problems where the cause was a receive queue of tens
> of thousands of messages, in this case each and every receive
> performs slowly unless the descriptor is near the front of the queue
> so the concern is not purely about memory usage at individual
> processes although that can also be a factor.
> Ashley Pittman, Bath, UK.
> Padb - A parallel job inspection tool for cluster computing
> users mailing list