Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Reduce performance
From: Ashley Pittman (ashley_at_[hidden])
Date: 2010-09-09 17:34:15


On 9 Sep 2010, at 21:40, Richard Treumann wrote:

>
> Ashley
>
> Can you provide an example of a situation in which these semantically redundant barriers help?

I'm not making the case for semantically redundant barriers, I'm making a case for implicit synchronisation in every iteration of a application. Many applications have this already by nature of the data-flow required, anything that calls mpi_allgather or mpi_allreduce are the easiest to verify but there are many other ways of achieving the same thing. My point is about the subset of programs which don't have this attribute and are therefore susceptible to synchronisation problems. It's my experience that for low iteration counts these codes can run fine but once they hit a problem they go over a cliff edge performance wise and there is no way back from there until the end of the job. The email from Gabriele would appear to be a case that demonstrates this problem but I've seen it many times before.

Using your previous email as an example I would describe adding barriers to a problem as a way artificially reducing the "elasticity" of the program to ensure balanced use of resources.

> I may be missing something but my statement for the text book would be
>
> "If adding a barrier to your MPI program makes it run faster, there is almost certainly a flaw in it that is better solved another way."
>
> The only exception I can think of is some sort of one direction data dependancy with messages small enough to go eagerly. A program that calls MPI_Reduce with a small message and the same root every iteration and calls no other collective would be an example.
>
> In that case, fast tasks at leaf positions would run free and a slow task near the root could pile up early arrivals and end up with some additional slowing. Unless it was driven into paging I cannot imagine the slowdown would be significant though.

I've diagnosed problems where the cause was a receive queue of tens of thousands of messages, in this case each and every receive performs slowly unless the descriptor is near the front of the queue so the concern is not purely about memory usage at individual processes although that can also be a factor.

Ashley,

-- 
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk