Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Reduce performance
From: Richard Treumann (treumann_at_[hidden])
Date: 2010-09-10 10:08:26


Hi Ashley

I understand the problem with descriptor flooding can be serious in an
application with unidirectional data dependancy. Perhaps we have a
different perception of how common that is.

It seems to me that such programs would be very rare but if they are more
common than I imagine, then discussion of how to modulate them is
worthwhile. In many cases, I think that adding some flow control to the
application is a better solution than semantically redundant barrier. (A
barrier that is there only to affect performance, not correctness, is what
I mean by semantically redundant)

For example, a Master/Worker application could have each worker break
after every 4th send to the master and post an MPI_Recv for an
OK_to_continue token. If the token had been sent, this would delay the
worker a few microseconds. If it had not been sent, the worker would be
kept waiting.

The Master would keep track of how many messages from each worker it had
absorbed and on message 3 from a particular worker, send an OK_to_continue
token to that worker. The master would keep sending OK_to_continue tokens
every 4th recv from then on (7, 11, 15 ...) The descriptor queues would
all remain short and only a worker that the master could not keep up with
would ever lose a chance to keep working. By sending the OK_to_continue
token a bit early, the application would ensure that when there was no
backlog, every worker would find a token when it looked for it and there
would be no significant loss of compute time.

Even with non-blocking barrier and a 10 step lag between Ibarrier and
Wait, , if some worker is slow for 12 steps, the fast workers end up being
kept in a non-productive MPI_Wait.

                  Dick

Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363

users-bounces_at_[hidden] wrote on 09/09/2010 05:34:15 PM:

> [image removed]
>
> Re: [OMPI users] MPI_Reduce performance
>
> Ashley Pittman
>
> to:
>
> Open MPI Users
>
> 09/09/2010 05:37 PM
>
> Sent by:
>
> users-bounces_at_[hidden]
>
> Please respond to Open MPI Users
>
>
> On 9 Sep 2010, at 21:40, Richard Treumann wrote:
>
> >
> > Ashley
> >
> > Can you provide an example of a situation in which these
> semantically redundant barriers help?
>
> I'm not making the case for semantically redundant barriers, I'm
> making a case for implicit synchronisation in every iteration of a
> application. Many applications have this already by nature of the
> data-flow required, anything that calls mpi_allgather or
> mpi_allreduce are the easiest to verify but there are many other
> ways of achieving the same thing. My point is about the subset of
> programs which don't have this attribute and are therefore
> susceptible to synchronisation problems. It's my experience that
> for low iteration counts these codes can run fine but once they hit
> a problem they go over a cliff edge performance wise and there is no
> way back from there until the end of the job. The email from
> Gabriele would appear to be a case that demonstrates this problem
> but I've seen it many times before.
>
> Using your previous email as an example I would describe adding
> barriers to a problem as a way artificially reducing the
> "elasticity" of the program to ensure balanced use of resources.
>
> > I may be missing something but my statement for the text book would be

> >
> > "If adding a barrier to your MPI program makes it run faster,
> there is almost certainly a flaw in it that is better solved another
way."
> >
> > The only exception I can think of is some sort of one direction
> data dependancy with messages small enough to go eagerly. A program
> that calls MPI_Reduce with a small message and the same root every
> iteration and calls no other collective would be an example.
> >
> > In that case, fast tasks at leaf positions would run free and a
> slow task near the root could pile up early arrivals and end up with
> some additional slowing. Unless it was driven into paging I cannot
> imagine the slowdown would be significant though.
>
> I've diagnosed problems where the cause was a receive queue of tens
> of thousands of messages, in this case each and every receive
> performs slowly unless the descriptor is near the front of the queue
> so the concern is not purely about memory usage at individual
> processes although that can also be a factor.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users