Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] MPI_Reduce performance
From: Ashley Pittman (ashley_at_[hidden])
Date: 2010-09-09 15:34:38

On 9 Sep 2010, at 17:00, Gus Correa wrote:

> Hello All
> Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary words, may be part of a larger context of load balance, or not?
> Would Ashley's recipe of sporadic barriers be a silver bullet to
> improve load imbalance problems, regardless of which collectives or
> even point-to-point calls are in use?

No, it only holds where there is no data dependency between some of the ranks, in particular if there are any non-rooted collectives in an iteration of your code then it cannot make any difference at all, likewise if you have a reduce followed by a barrier using the same root for example then you already have global synchronisation each iteration and it won't help. My feeling is that it applies to a significant minority of problems, certainly the phrase "adding barriers can make codes faster" should be textbook stuff if it isn't already.

> Would sporadic barriers in the flux coupler "shake up" these delays?

I don't fully understand your description but it sounds like it might set the program back to a clean slate which would give you per-iteraion delays only rather than cumulative or worse delays.

> Ashley: How did you get to the magic number of 25 iterations for the
> sporadic barriers?

Experience and finger in the air. The major factors in picking this number is the likelihood of a positives feedback cycle of delays happening, the delays these delays add and the cost of a barrier itself. Having too low a value will slightly reduce performance, having too high a value can drastically reduce performance.

As a further item (because I like them) the asynchronous barrier is even better again if used properly, in the good case it doesn't cause any process to block ever so the cost is only that of the CPU cycles the code takes itself, in the bad case where it has to delay a rank then this tends to have a positive impact on performance.

> Would it be application/communicator pattern dependent?



Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing