Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [RFC] Low pressure OPAL progress
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-06-09 07:28:34


Sylvain Jeaugey wrote:
> Hi Ralph,
>
> I'm entirely convinced that MPI doesn't have to save power in a normal
> scenario. The idea is just that if an MPI process is blocked (i.e. has
> not performed progress for -say- 5 minutes (default in my
> implementation), we stop busy polling and have the process drop from
> 100% CPU usage to 0%.
>
> I do not call sleep() but usleep(). The result if quite the same, but
> is less hurting performance in case of (unexpected) restart.
>
> However, the goal of my RFC was also to know if there was a more clean
> way to achieve my goal, and from what I read, I guess I should look at
> the "tick" rate instead of trying to do my own delaying.
>
One way around this is to make all blocked communications (even SM) to
use poll to block for incoming messages. Jeff and I have discussed this
and had many false starts on it. The biggest issue is coming up with a
way to have blocks on the SM btl converted to the system poll call
without requiring a socket write for every packet.

The usleep solution works but is kind of ugly IMO. I think when I
looked at doing that the overhead increased signifcantly for certain
communications. Maybe not for toy benchmarks but for less synchronized
processes I saw the usleep adding overhead where I didn't want it too.

--td
> Don't worry, I was quite expecting the configure-in requirement.
> However, I don't think my patch is good for inclusion, it is only an
> example to describe what I want to achieve.
>
> Thanks a lot for your comments,
> Sylvain
>
> On Mon, 8 Jun 2009, Ralph Castain wrote:
>
>> I'm not entirely convinced this actually achieves your goals, but I
>> can see some potential benefits. I'm also not sure that power
>> consumption is that big of an issue that MPI needs to begin chasing
>> "power saver" modes of operation, but that can be a separate debate
>> some day.
>>
>> I'm assuming you don't mean that you actually call "sleep()" as this
>> would be very bad - I'm assuming you just change the opal_progress
>> "tick" rate instead. True? If not, and you really call "sleep", then
>> I would have to oppose adding this to the code base pending
>> discussion with others who can corroborate that this won't cause
>> problems.
>>
>> Either way, I could live with this so long as it was done as a
>> "configure-in" capability. Just having the params default to a value
>> that causes the system to behave similarly to today isn't enough - we
>> still wind up adding logic into a very critical timing loop for no
>> reason. A simple configure option of --enable-mpi-progress-monitoring
>> would be sufficient to protect the code.
>>
>> HTH
>> Ralph
>>
>>
>> On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
>>
>>> What : when nothing has been received for a very long time - e.g. 5
>>> minutes, stop busy polling in opal_progress and switch to a
>>> usleep-based one.
>>>
>>> Why : when we have long waits, and especially when an application is
>>> deadlock'ed, detecting it is not easy and a lot of power is wasted
>>> until the end of the time slice (if there is one).
>>>
>>> Where : an example of how it could be implemented is available at
>>> http://bitbucket.org/jeaugeys/low-pressure-opal-progress/
>>>
>>> Principle
>>> =========
>>>
>>> opal_progress() ensures the progression of MPI communication. The
>>> current algorithm is a loop calling progress on all registered
>>> components. If the program is blocked, the loop will busy-poll
>>> indefinetely.
>>>
>>> Going to sleep after a certain amount of time with nothing received
>>> is interesting for two things :
>>> - Administrator can easily detect whether a job is deadlocked : all
>>> the processes are in sleep(). Currently, all processors are using
>>> 100% cpu and it is very hard to know if progression is still
>>> happening or not.
>>> - When there is nothing to receive, power usage is highly reduced.
>>>
>>> However, it could hurt performance in some cases, typically if we go
>>> to sleep just before the message arrives. This will highly depend on
>>> the parameters you give to the sleep mechanism.
>>>
>>> At first, we can start with the following assumption : if the sleep
>>> takes T usec, then sleeping after 10000xT should slow down Receives
>>> by a factor less than 0.01 %.
>>>
>>> However, other processes may suffer from you being late, and be
>>> delayed by T usec (which may represent more than 0.01% for them).
>>>
>>> So, the goal of this mechanism is mainly to detect
>>> far-too-long-waits and should quite never be used in normal MPI
>>> jobs. It could also trigger a warning message when starting to
>>> sleep, or at least a trace in the notifier.
>>>
>>> Details of Implementation
>>> =========================
>>>
>>> Three parameters fully control the behaviour of this mechanism :
>>> * opal_progress_sleep_count : number of unsuccessful opal_progress()
>>> calls before we start the timer (to prevent latency impact). It
>>> defaults to -1, which completely deactivates the sleep (and is
>>> therefore equivalent to the former code). A value of 1000 can be
>>> thought of as a starting point to enable this mechanism.
>>> * opal_progress_sleep_trigger : time to wait before going to
>>> low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
>>> * opal_progress_sleep_duration : time we sleep at each further
>>> unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms.
>>>
>>> The duration is big enough to make the process show 0% CPU in top,
>>> but low enough to preserve a good trigger/duration ratio.
>>>
>>> The trigger is voluntary high to keep a good trigger/duration ratio.
>>> Indeed, to prevent delays from causing chain reactions, trigger
>>> should be higher than duration * numprocs.
>>>
>>> Possible Improvements & Pitfalls
>>> ================================
>>>
>>> * Trigger could be set automatically at max(trigger, duration *
>>> numprocs * 2).
>>>
>>> * poll_start and poll_count could be fields of the opal_condition_t
>>> struct.
>>>
>>> * The sleep section may be exported in a #define and reported in all
>>> the progress pathes (I'm not sure my patch is good for progress
>>> threads for example)
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel