Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Busy waiting [was Re: (no subject)]
From: Ingo Josopait (ingo.josopait_at_[hidden])
Date: 2008-04-24 06:56:03

I am using one of the nodes as a desktop computer. Therefore it is most
important for me that the mpi program is not so greedily acquiring cpu
time. But I would imagine that the energy consumption is generally a big
issue, since energy is a major cost factor in a computer cluster. When a
cpu is idle, it uses considerably less energy. Last time I checked my
computer used 180W when both cpu cores were working and 110W when both
cores were idle.

I just made a small hack to solve the problem. I inserted a simple sleep
call into the function 'opal_condition_wait':

--- orig/openmpi-1.2.6/opal/threads/condition.h
+++ openmpi-1.2.6/opal/threads/condition.h
@@ -78,6 +78,7 @@
     } else {
         while (c->c_signaled == 0) {
+ usleep(1000);

The usleep call will let the program sleep for about 4 ms (it won't
sleep for a shorter time because of some timer granularity). But that is
good enough for me. The cpu usage is (almost) zero when the tasks are
waiting for one another.

For a proper implementation you would want to actively poll without a
sleep call for a few milliseconds, and then use some other method that
sleeps not for a fixed time, but until new messages arrive.

Barry Rountree schrieb:
> On Wed, Apr 23, 2008 at 11:38:41PM +0200, Ingo Josopait wrote:
>> I can think of several advantages that using blocking or signals to
>> reduce the cpu load would have:
>> - Reduced energy consumption
> Not necessarily. Any time the program ends up running longer, the
> cluster is up and running (and wasting electricity) for that amount of
> time. In the case where lots of tiny messages are being sent you could
> easily end up using more energy.
>> - Running additional background programs could be done far more efficiently
> It's usually more efficient -- especially in terms of cache -- to batch
> up programs to run one after the other instead of running them
> simultaneously.
>> - It would be much simpler to examine the load balance.
> This is true, but it's still pretty trivial to measure load imbalance.
> MPI allows you to write a wrapper library that intercepts any MPI_*
> call. You can instrument the code however you like, then call PMPI_*,
> then catch the return value, finish your instrumentation, and return
> control to your program. Here's some pseudocode:
> int MPI_Barrier(MPI_Comm comm){
> gettimeofday(&start, NULL);
> rc=PMPI_Barrier( comm );
> gettimeofday(&stop, NULL);
> fprintf( logfile, "Barrier on node %d took %lf seconds\n",
> rank, delta(&stop, &start) );
> return rc;
> }
> I've got some code that does this for all of the MPI calls in OpenMPI
> (ah, the joys of writing C code using python scripts). Let me know if
> you'd find it useful.
>> It may depend on the type of program and the computational environment,
>> but there are certainly many cases in which putting the system in idle
>> mode would be advantageous. This is especially true for programs with
>> low network traffic and/or high load imbalances.
> <grin> I could use a few more benchmarks like that. Seriously, if
> you're mostly concerned about saving energy, a quick hack is to set a
> timer as soon as you enter an MPI call (say for 100ms) and if the timer
> goes off while you're still in the call, use DVS to drop your CPU
> frequency to the lowest value it has. Then, when you exit the MPI call,
> pop it back up to the highest frequency. This can save a significant
> amount of energy, but even here there can be a performance penalty. For
> example, UMT2K schleps around very large messages, and you really need
> to be running as fast as possible during the MPI_Waitall calls or the
> program will slow down by 1% or so (thus using more energy).
> Doing this just for Barriers and Allreduces seems to speed up the
> program a tiny bit, but I haven't done enough runs to make sure this
> isn't an artifact.
> (This is my dissertation topic, so before asking any question be advised
> that I WILL talk your ear off.)
>> The "spin for a while and then block" method that you mentioned earlier
>> seems to be a good compromise. Just do polling for some time that is
>> long compared to the corresponding system call, and then go to sleep if
>> nothing happens. In this way, the latency would be only marginally
>> increased, while less cpu time is wasted in the polling loops, and I
>> would be much happier.
> I'm interested in seeing what this does for energy savings. Are you
> volunteering to test a patch? (I've got four other papers I need to
> get finished up, so it'll be a few weeks before I start coding.)
> Barry Rountree
> Ph.D. Candidate, Computer Science
> University of Georgia
>> Jeff Squyres schrieb:
>>> On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:
>>>> Do you really mean that Open-MPI uses busy loop in order to handle
>>>> incomming calls? It seems to be incorrect since
>>>> spinning is a very bad and inefficient technique for this purpose.
>>> It depends on what you're optimizing for. :-) We're optimizing for
>>> minimum message passing latency on hosts that are not oversubscribed;
>>> polling is very good at that. Polling is much better than blocking,
>>> particularly if the blocking involves a system call (which will be
>>> "slow"). Note that in a compute-heavy environment, they nodes are
>>> going to be running at 100% CPU anyway.
>>> Also keep in mind that you're only going to have "waste" spinning in
>>> MPI if you have a loosely/poorly synchronized application. Granted,
>>> some applications are this way by nature, but we have not chosen to
>>> optimize spare CPU cycles for them. As I said in a prior mail, adding
>>> a blocking strategy is on the to-do list, but it's fairly low in
>>> priority right now. Someone may care / improve the message passing
>>> engine to include blocking, but it hasn't happened yet. Want to work
>>> on it? :-)
>>> And for reference: almost all MPI's do busy polling to minimize
>>> latency. Some of them will shift to blocking if nothing happens for a
>>> "long" time. This second piece is what OMPI is lacking.
>>>> Why
>>>> don't you use blocking and/or signals instead of
>>>> that?
>>> FWIW: I mentioned this in my other mail -- latency is quite definitely
>>> negatively impacted when you use such mechanisms. Blocking and
>>> signals are "slow" (in comparison to polling).
>>>> I think the priority of this task is very high because polling
>>>> just wastes resources of the system.
>>> In production HPC environments, the entire resource is dedicated to
>>> the MPI app anyway, so there's nothing else that really needs it. So
>>> we allow them to busy-spin.
>>> There is a mode to call yield() in the middle of every OMPI progress
>>> loop, but it's only helpful for loosely/poorly synchronized MPI apps
>>> and ones that use TCP or shared memory. Low latency networks such as
>>> IB or Myrinet won't be as friendly to this setting because they're
>>> busy polling (i.e., they call yield() much less frequently, if at all).
>>>> On the other hand,
>>>> what Alberto claims is not reasonable to me.
>>>> Alberto,
>>>> - Are you oversubscribing one node which means that you are running
>>>> your
>>>> code on a single processor machine, pretending
>>>> to have four CPUs?
>>>> - Did you compile Open-MPI or installed from RPM?
>>>> Receiving process shouldn't be that expensive.
>>>> Regards,
>>>> Danesh
>>>> Jeff Squyres skrev:
>>>>> Because on-node communication typically uses shared memory, so we
>>>>> currently have to poll. Additionally, when using mixed on/off-node
>>>>> communication, we have to alternate between polling shared memory and
>>>>> polling the network.
>>>>> Additionally, we actively poll because it's the best way to lower
>>>>> latency. MPI implementations are almost always first judged on their
>>>>> latency, not [usually] their CPU utilization. Going to sleep in a
>>>>> blocking system call will definitely negatively impact latency.
>>>>> We have plans for implementing the "spin for a while and then block"
>>>>> technique (as has been used in other MPI's and middleware layers),
>>>>> but
>>>>> it hasn't been a high priority.
>>>>> On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:
>>>>>> Thanks Torje. I wonder what is the benefit of looping on the
>>>>>> incoming
>>>>>> message-queue socket rather than using system I/O signals, like read
>>>>>> () or select().
>>>>>> On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:
>>>>>>> Hi Alberto,
>>>>>>> The blocked processes are in fact spin-waiting. While they don't
>>>>>>> have
>>>>>>> anything better to do (waiting for that message), they will check
>>>>>>> their incoming message-queues in a loop.
>>>>>>> So the MPI_Recv()-operation is blocking, but it doesn't mean that
>>>>>>> the
>>>>>>> processes are blocked by the OS scheduler.
>>>>>>> I hope that made some sense :)
>>>>>>> Best regards,
>>>>>>> Torje
>>>>>>> On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:
>>>>>>>> I have simple MPI program that sends data to processor rank 0. The
>>>>>>>> communication works well but when I run the program on more than 2
>>>>>>>> processors (-np 4) the extra receivers waiting for data run on >
>>>>>>>> 90%
>>>>>>>> CPU load. I understand MPI_Recv() is a blocking operation, but why
>>>>>>>> does it consume so much CPU compared to a regular system read()?
>>>>>>>> #include <sys/types.h>
>>>>>>>> #include <unistd.h>
>>>>>>>> #include <stdio.h>
>>>>>>>> #include <stdlib.h>
>>>>>>>> #include <mpi.h>
>>>>>>>> void process_sender(int);
>>>>>>>> void process_receiver(int);
>>>>>>>> int main(int argc, char* argv[])
>>>>>>>> {
>>>>>>>> int rank;
>>>>>>>> MPI_Init(&argc, &argv);
>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>>> printf("Processor %d (%d) initialized\n", rank, getpid());
>>>>>>>> if( rank == 1 )
>>>>>>>> process_sender(rank);
>>>>>>>> else
>>>>>>>> process_receiver(rank);
>>>>>>>> MPI_Finalize();
>>>>>>>> }
>>>>>>>> void process_sender(int rank)
>>>>>>>> {
>>>>>>>> int i, j, size;
>>>>>>>> float data[100];
>>>>>>>> MPI_Status status;
>>>>>>>> printf("Processor %d initializing data...\n", rank);
>>>>>>>> for( i = 0; i < 100; ++i )
>>>>>>>> data[i] = i;
>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>>>> printf("Processor %d sending data...\n", rank);
>>>>>>>> MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
>>>>>>>> printf("Processor %d sent data\n", rank);
>>>>>>>> }
>>>>>>>> void process_receiver(int rank)
>>>>>>>> {
>>>>>>>> int count;
>>>>>>>> float value[200];
>>>>>>>> MPI_Status status;
>>>>>>>> printf("Processor %d waiting for data...\n", rank);
>>>>>>>> MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
>>>>>>>> MPI_COMM_WORLD, &status);
>>>>>>>> printf("Processor %d Got data from processor %d\n", rank,
>>>>>>>> status.MPI_SOURCE);
>>>>>>>> MPI_Get_count(&status, MPI_FLOAT, &count);
>>>>>>>> printf("Processor %d, Got %d elements\n", rank, count);
>>>>>>>> }
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]