Ashley's observation may apply to an
application that iterates on many to one communication patterns. If the
only collective used is MPI_Reduce, some non-root tasks can get ahead and
keep pushing iteration results at tasks that are nearer the root. This
could overload them and cause some extra slow down.
In most parallel applications, there
is some web of interdependency across tasks between iterations that keeps
them roughly in step. I find it hare to believe that there are many
programs that need semantically redundant MPI_Barriers.
For example -
In a program that does neighbor communication,
no task can get very far ahead of its neighbors. It is possible for
a task at one corner to be a a few steps ahead of one at the opposite corner
but only a few steps. In this case though, the distant neighbor is not
being affected by that task that is out ahead anyway. It is only affected
by its immediate neighbors,
In a program that does an MPI_Bcast
from root and an MPI_Reduce to root in each iteration, No task gets far
ahead because the task that finished the Bcast early, just wait longer
at the Reduce.
An program that makes a call to a non-rooted
collective every iteration will stay in pretty tight synch.
Think carefully before tossing in either
MPI_Barrier or some non-blocking barrier. Unless MPI_Bcast or MPI_Reduce
is the only collective you call, your problem is likely not progress skew..
Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
From:
Ashley Pittman <ashley@pittman.co.uk>
To:
Open MPI Users <users@open-mpi.org>
Date:
09/09/2010 03:53 AM
Subject:
Re: [OMPI users] MPI_Reduce performance
Sent by:
users-bounces@open-mpi.org
On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:
> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
>> As people have said, these time values are to be expected. All
they
>> reflect is the time difference spent in reduce waiting for the
slowest
>> process to catch up to everyone else. The barrier removes that
factor
>> by forcing all processes to start from the same place.
>>
>>
>> No mystery here - just a reflection of the fact that your processes
>> arrive at the MPI_Reduce calls at different times.
>
>
> Yes, however, it seems Gabriele is saying the total execution time
> *drops* by ~500 s when the barrier is put *in*. (Is that the
right way
> around, Gabriele?)
>
> That's harder to explain as a sync issue.
Not really, you need some way of keeping processes in sync or else the
slow ones get slower and the fast ones stay fast. If you have an
un-balanced algorithm then you can end up swamping certain ranks and when
they get behind they get even slower and performance goes off a cliff edge.
Adding sporadic barriers keeps everything in sync and running nicely, if
things are performing well then the barrier only slows things down but
if there is a problem it'll bring all process back together and destroy
the positive feedback cycle. This is why you often only need a synchronisation
point every so often, I'm also a huge fan of asyncronous barriers as a
full sync is a blunt and slow operation, using asyncronous barriers you
can allow small differences in timing but prevent them from getting too
large with very little overhead in the common case where processes are
synced already. I'm thinking specifically of starting a sync-barrier
on iteration N, waiting for it on N+25 and immediately starting another
one, again waiting for it 25 steps later.