Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [RFC] Low pressure OPAL progress
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-06-09 07:28:34

Sylvain Jeaugey wrote:
> Hi Ralph,
> I'm entirely convinced that MPI doesn't have to save power in a normal
> scenario. The idea is just that if an MPI process is blocked (i.e. has
> not performed progress for -say- 5 minutes (default in my
> implementation), we stop busy polling and have the process drop from
> 100% CPU usage to 0%.
> I do not call sleep() but usleep(). The result if quite the same, but
> is less hurting performance in case of (unexpected) restart.
> However, the goal of my RFC was also to know if there was a more clean
> way to achieve my goal, and from what I read, I guess I should look at
> the "tick" rate instead of trying to do my own delaying.
One way around this is to make all blocked communications (even SM) to
use poll to block for incoming messages. Jeff and I have discussed this
and had many false starts on it. The biggest issue is coming up with a
way to have blocks on the SM btl converted to the system poll call
without requiring a socket write for every packet.

The usleep solution works but is kind of ugly IMO. I think when I
looked at doing that the overhead increased signifcantly for certain
communications. Maybe not for toy benchmarks but for less synchronized
processes I saw the usleep adding overhead where I didn't want it too.

> Don't worry, I was quite expecting the configure-in requirement.
> However, I don't think my patch is good for inclusion, it is only an
> example to describe what I want to achieve.
> Thanks a lot for your comments,
> Sylvain
> On Mon, 8 Jun 2009, Ralph Castain wrote:
>> I'm not entirely convinced this actually achieves your goals, but I
>> can see some potential benefits. I'm also not sure that power
>> consumption is that big of an issue that MPI needs to begin chasing
>> "power saver" modes of operation, but that can be a separate debate
>> some day.
>> I'm assuming you don't mean that you actually call "sleep()" as this
>> would be very bad - I'm assuming you just change the opal_progress
>> "tick" rate instead. True? If not, and you really call "sleep", then
>> I would have to oppose adding this to the code base pending
>> discussion with others who can corroborate that this won't cause
>> problems.
>> Either way, I could live with this so long as it was done as a
>> "configure-in" capability. Just having the params default to a value
>> that causes the system to behave similarly to today isn't enough - we
>> still wind up adding logic into a very critical timing loop for no
>> reason. A simple configure option of --enable-mpi-progress-monitoring
>> would be sufficient to protect the code.
>> HTH
>> Ralph
>> On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
>>> What : when nothing has been received for a very long time - e.g. 5
>>> minutes, stop busy polling in opal_progress and switch to a
>>> usleep-based one.
>>> Why : when we have long waits, and especially when an application is
>>> deadlock'ed, detecting it is not easy and a lot of power is wasted
>>> until the end of the time slice (if there is one).
>>> Where : an example of how it could be implemented is available at
>>> Principle
>>> =========
>>> opal_progress() ensures the progression of MPI communication. The
>>> current algorithm is a loop calling progress on all registered
>>> components. If the program is blocked, the loop will busy-poll
>>> indefinetely.
>>> Going to sleep after a certain amount of time with nothing received
>>> is interesting for two things :
>>> - Administrator can easily detect whether a job is deadlocked : all
>>> the processes are in sleep(). Currently, all processors are using
>>> 100% cpu and it is very hard to know if progression is still
>>> happening or not.
>>> - When there is nothing to receive, power usage is highly reduced.
>>> However, it could hurt performance in some cases, typically if we go
>>> to sleep just before the message arrives. This will highly depend on
>>> the parameters you give to the sleep mechanism.
>>> At first, we can start with the following assumption : if the sleep
>>> takes T usec, then sleeping after 10000xT should slow down Receives
>>> by a factor less than 0.01 %.
>>> However, other processes may suffer from you being late, and be
>>> delayed by T usec (which may represent more than 0.01% for them).
>>> So, the goal of this mechanism is mainly to detect
>>> far-too-long-waits and should quite never be used in normal MPI
>>> jobs. It could also trigger a warning message when starting to
>>> sleep, or at least a trace in the notifier.
>>> Details of Implementation
>>> =========================
>>> Three parameters fully control the behaviour of this mechanism :
>>> * opal_progress_sleep_count : number of unsuccessful opal_progress()
>>> calls before we start the timer (to prevent latency impact). It
>>> defaults to -1, which completely deactivates the sleep (and is
>>> therefore equivalent to the former code). A value of 1000 can be
>>> thought of as a starting point to enable this mechanism.
>>> * opal_progress_sleep_trigger : time to wait before going to
>>> low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
>>> * opal_progress_sleep_duration : time we sleep at each further
>>> unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms.
>>> The duration is big enough to make the process show 0% CPU in top,
>>> but low enough to preserve a good trigger/duration ratio.
>>> The trigger is voluntary high to keep a good trigger/duration ratio.
>>> Indeed, to prevent delays from causing chain reactions, trigger
>>> should be higher than duration * numprocs.
>>> Possible Improvements & Pitfalls
>>> ================================
>>> * Trigger could be set automatically at max(trigger, duration *
>>> numprocs * 2).
>>> * poll_start and poll_count could be fields of the opal_condition_t
>>> struct.
>>> * The sleep section may be exported in a #define and reported in all
>>> the progress pathes (I'm not sure my patch is good for progress
>>> threads for example)
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]