Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [RFC] Low pressure OPAL progress
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-06-09 08:52:20


I understand your point of view, and mostly share it.

I think the biggest point in my example is that sleep occurs only after (I
was wrong in my previous e-mail) 10 minutes of inactivity, and this value
is fully configurable. I didn't intend to call sleep after 2 seconds.
Plus, as said before, I planned to have the library do show_help() when
this happens (something like : "Open MPI couldn't receive a message for 10
minutes, lowering pressure") so that the application that really needs
more than 10 minutes to receive a message can increase it.

Looking at the tick rate code, I couldn't see how changing it would make
CPU usage drop. If I understand correctly your e-mail, you block in the
kernel using poll(), is that right ? So, you may well loose 10 us because
of that kernel call, but this is a lot less than the 1 ms I'm currently
loosing with usleep. This makes sense - although being hard to implement
since all btl must have this ability.

Thanks for your comments, I will continue to think about it.

Sylvain

On Tue, 9 Jun 2009, Ralph Castain wrote:

> My concern with any form of sleep is with the impact on the proc - since
> opal_progress might not be running in a separate thread, won't the sleep
> apply to the process as a whole? In that case, the process isn't free to
> continue computing.
>
> I can envision applications that might call down into the MPI library and
> have opal_progress not find anything, but there is nothing wrong. The
> application could continue computations just fine. I would hate to see us put
> the process to sleep just because the MPI library wasn't busy enough.
>
> Hence my suggestion to just change the tick rate. It would definitely cause a
> higher latency for the first message that arrived while in this state, which
> is bothersome, but would meet the stated objective without interfering with
> the process itself.
>
> LANL has also been looking at this problem of stalled jobs, but from a
> different approach. We monitor (using a separate job) progress in terms of
> output files changing in size plus other factors as specified by the user. If
> we don't see any progress in those terms over some time, then we kill the
> job. We chose that path because of the concerns expressed above - e.g., on
> our RR machine, intense computations can be underway on the Cell blades while
> the Opteron MPI processes wait for us to reach a communication point. We
> -want- those processes spinning away so that, when the comm starts, it can
> proceed as quickly as possible.
>
> Just some thoughts...
> Ralph
>
>
> On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:
>
>> Sylvain Jeaugey wrote:
>>> Hi Ralph,
>>>
>>> I'm entirely convinced that MPI doesn't have to save power in a normal
>>> scenario. The idea is just that if an MPI process is blocked (i.e. has not
>>> performed progress for -say- 5 minutes (default in my implementation), we
>>> stop busy polling and have the process drop from 100% CPU usage to 0%.
>>>
>>> I do not call sleep() but usleep(). The result if quite the same, but is
>>> less hurting performance in case of (unexpected) restart.
>>>
>>> However, the goal of my RFC was also to know if there was a more clean way
>>> to achieve my goal, and from what I read, I guess I should look at the
>>> "tick" rate instead of trying to do my own delaying.
>>>
>> One way around this is to make all blocked communications (even SM) to use
>> poll to block for incoming messages. Jeff and I have discussed this and
>> had many false starts on it. The biggest issue is coming up with a way to
>> have blocks on the SM btl converted to the system poll call without
>> requiring a socket write for every packet.
>>
>> The usleep solution works but is kind of ugly IMO. I think when I looked
>> at doing that the overhead increased signifcantly for certain
>> communications. Maybe not for toy benchmarks but for less synchronized
>> processes I saw the usleep adding overhead where I didn't want it too.
>>
>> --td
>>> Don't worry, I was quite expecting the configure-in requirement. However,
>>> I don't think my patch is good for inclusion, it is only an example to
>>> describe what I want to achieve.
>>>
>>> Thanks a lot for your comments,
>>> Sylvain
>>>
>>> On Mon, 8 Jun 2009, Ralph Castain wrote:
>>>
>>>> I'm not entirely convinced this actually achieves your goals, but I can
>>>> see some potential benefits. I'm also not sure that power consumption is
>>>> that big of an issue that MPI needs to begin chasing "power saver" modes
>>>> of operation, but that can be a separate debate some day.
>>>>
>>>> I'm assuming you don't mean that you actually call "sleep()" as this
>>>> would be very bad - I'm assuming you just change the opal_progress "tick"
>>>> rate instead. True? If not, and you really call "sleep", then I would
>>>> have to oppose adding this to the code base pending discussion with
>>>> others who can corroborate that this won't cause problems.
>>>>
>>>> Either way, I could live with this so long as it was done as a
>>>> "configure-in" capability. Just having the params default to a value that
>>>> causes the system to behave similarly to today isn't enough - we still
>>>> wind up adding logic into a very critical timing loop for no reason. A
>>>> simple configure option of --enable-mpi-progress-monitoring would be
>>>> sufficient to protect the code.
>>>>
>>>> HTH
>>>> Ralph
>>>>
>>>>
>>>> On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
>>>>
>>>>> What : when nothing has been received for a very long time - e.g. 5
>>>>> minutes, stop busy polling in opal_progress and switch to a usleep-based
>>>>> one.
>>>>>
>>>>> Why : when we have long waits, and especially when an application is
>>>>> deadlock'ed, detecting it is not easy and a lot of power is wasted until
>>>>> the end of the time slice (if there is one).
>>>>>
>>>>> Where : an example of how it could be implemented is available at
>>>>> http://bitbucket.org/jeaugeys/low-pressure-opal-progress/
>>>>>
>>>>> Principle
>>>>> =========
>>>>>
>>>>> opal_progress() ensures the progression of MPI communication. The
>>>>> current algorithm is a loop calling progress on all registered
>>>>> components. If the program is blocked, the loop will busy-poll
>>>>> indefinetely.
>>>>>
>>>>> Going to sleep after a certain amount of time with nothing received is
>>>>> interesting for two things :
>>>>> - Administrator can easily detect whether a job is deadlocked : all the
>>>>> processes are in sleep(). Currently, all processors are using 100% cpu
>>>>> and it is very hard to know if progression is still happening or not.
>>>>> - When there is nothing to receive, power usage is highly reduced.
>>>>>
>>>>> However, it could hurt performance in some cases, typically if we go to
>>>>> sleep just before the message arrives. This will highly depend on the
>>>>> parameters you give to the sleep mechanism.
>>>>>
>>>>> At first, we can start with the following assumption : if the sleep
>>>>> takes T usec, then sleeping after 10000xT should slow down Receives by a
>>>>> factor less than 0.01 %.
>>>>>
>>>>> However, other processes may suffer from you being late, and be delayed
>>>>> by T usec (which may represent more than 0.01% for them).
>>>>>
>>>>> So, the goal of this mechanism is mainly to detect far-too-long-waits
>>>>> and should quite never be used in normal MPI jobs. It could also trigger
>>>>> a warning message when starting to sleep, or at least a trace in the
>>>>> notifier.
>>>>>
>>>>> Details of Implementation
>>>>> =========================
>>>>>
>>>>> Three parameters fully control the behaviour of this mechanism :
>>>>> * opal_progress_sleep_count : number of unsuccessful opal_progress()
>>>>> calls before we start the timer (to prevent latency impact). It defaults
>>>>> to -1, which completely deactivates the sleep (and is therefore
>>>>> equivalent to the former code). A value of 1000 can be thought of as a
>>>>> starting point to enable this mechanism.
>>>>> * opal_progress_sleep_trigger : time to wait before going to
>>>>> low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
>>>>> * opal_progress_sleep_duration : time we sleep at each further
>>>>> unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms.
>>>>>
>>>>> The duration is big enough to make the process show 0% CPU in top, but
>>>>> low enough to preserve a good trigger/duration ratio.
>>>>>
>>>>> The trigger is voluntary high to keep a good trigger/duration ratio.
>>>>> Indeed, to prevent delays from causing chain reactions, trigger should
>>>>> be higher than duration * numprocs.
>>>>>
>>>>> Possible Improvements & Pitfalls
>>>>> ================================
>>>>>
>>>>> * Trigger could be set automatically at max(trigger, duration * numprocs
>>>>> * 2).
>>>>>
>>>>> * poll_start and poll_count could be fields of the opal_condition_t
>>>>> struct.
>>>>>
>>>>> * The sleep section may be exported in a #define and reported in all the
>>>>> progress pathes (I'm not sure my patch is good for progress threads for
>>>>> example)
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>