Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Reduce performance
From: Ashley Pittman (ashley_at_[hidden])
Date: 2010-09-09 15:34:38

On 9 Sep 2010, at 17:00, Gus Correa wrote:

> Hello All
> Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary words, may be part of a larger context of load balance, or not?
> Would Ashley's recipe of sporadic barriers be a silver bullet to
> improve load imbalance problems, regardless of which collectives or
> even point-to-point calls are in use?

No, it only holds where there is no data dependency between some of the ranks, in particular if there are any non-rooted collectives in an iteration of your code then it cannot make any difference at all, likewise if you have a reduce followed by a barrier using the same root for example then you already have global synchronisation each iteration and it won't help. My feeling is that it applies to a significant minority of problems, certainly the phrase "adding barriers can make codes faster" should be textbook stuff if it isn't already.

> Would sporadic barriers in the flux coupler "shake up" these delays?

I don't fully understand your description but it sounds like it might set the program back to a clean slate which would give you per-iteraion delays only rather than cumulative or worse delays.

> Ashley: How did you get to the magic number of 25 iterations for the
> sporadic barriers?

Experience and finger in the air. The major factors in picking this number is the likelihood of a positives feedback cycle of delays happening, the delays these delays add and the cost of a barrier itself. Having too low a value will slightly reduce performance, having too high a value can drastically reduce performance.

As a further item (because I like them) the asynchronous barrier is even better again if used properly, in the good case it doesn't cause any process to block ever so the cost is only that of the CPU cycles the code takes itself, in the bad case where it has to delay a rank then this tends to have a positive impact on performance.

> Would it be application/communicator pattern dependent?



Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing