Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] benchmark - mpi_reduce() called only once but takes long time - proportional to calculation time
From: Qing Pang (qing.pang_at_[hidden])
Date: 2009-12-04 17:20:40


Thank you so much! It is a synchronization issue. In my case, one node
actually run slower than the other node. Adding MPE_Barrier() helps to
straight things out.
Thank you for your help!

Eugene Loh wrote:
> Your processes are probably running asynchronously. You could perhaps
> try tracing program execution and look at the timeline. E.g.,
> http://www.open-mpi.org/faq/?category=perftools#free-tools . Or,
> where you have MPI_Wtime calls, just capture those timestamps on each
> process and dump the results at the end of your run. Or, report
> timings for all ranks instead of just for rank 0.
>
> Put another way, rank 0 must broadcast n. So, no one starts
> computation until they get the Bcast result. Rank 0 probably starts
> its computations before anyone else does. So, it gets to the Reduce
> before anyone else does, but it can't exit until other ranks have
> finished their computations. So, the Reduce time on rank 0 includes
> some amount of other ranks' compute times.
>
> Yet another approach is to insert MPI_Barrier calls at each phase of
> the program so that the various phases are synchronized. This adds
> some overhead to the program, but helps simplify interpretation of the
> timing results.
>
> Qing Pang wrote:
>
>> I'm running the popular Calculate PI program on a 2 node setting
>> running ubuntu 8.10 and openmpi1.3.3(with default settings).
>> Password-less ssh is set up but no cluster management program such as
>> network file system, network time protocol, resource management,
>> scheduler, etc. The two nodes are connected though TCP/IP only.
>>
>> When I tried to benchmark the program, it shows that the time spent
>> on MPI_Reduce(), is proportional to the Number-of-Intervals (n) used
>> in calculation. For example, when n = 1,000,000, MPI_Reduce costs
>> 15.65 milliseconds; while n= 1,000,000,000, MPI_Reduce costs 15526
>> milliseconds.
>>
>> This confused me - in this Calc-PI program, MPI_Reduce is used only
>> once - no matter what number of intervals is used, MPI_Reduce is
>> called after both nodes got the result, to merge the result - just
>> once. So the time cost by MPI_Reduce (all though it might be slow
>> through TCP/IP connection) should be somewhat consistent. But
>> obviously it's not what I saw.
>>
>> Had anyone have the similar problem before? I'm not sure how
>> MPI_Reduce() work internally. Does the fact that I don't have network
>> file system, network time protocol, resource management, scheduler,
>> etc installed matters?
>>
>> Below is the program - I did feed "n" to it more than once to warm it
>> up.
>>
>> #include "mpi.h"
>> #include <stdio.h>
>> #include <math.h>
>>
>> int main(int argc, char *argv[]) { int numprocs, myid, rc;
>> double ACCUPI = 3.1415926535897932384626433832795;
>> double mypi, pi, h, sum, x;
>> int n, i;
>> double starttime, endtime;
>> double time,told,bcasttime,reducetime,comptime,totaltime;
>>
>> rc = MPI_Init(&argc,&argv);
>> if (rc != MPI_SUCCESS) {
>> printf("Error starting MPI program. Terminating.\n");
>> MPI_Abort(MPI_COMM_WORLD, rc);
>> }
>> MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
>> MPI_Comm_rank(MPI_COMM_WORLD,&myid);
>>
>> while (1) {
>> if (myid == 0) {
>> printf("Enter the number of intervals: (0 quits) \n");
>> scanf("%d",&n);
>> starttime = MPI_Wtime();
>> }
>>
>> time = MPI_Wtime();
>> MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
>>
>> told = time;
>> time = MPI_Wtime();
>> bcasttime = time - told;
>>
>> if (n == 0)
>> break;
>> else {
>> h = 1.0/(double)n;
>> sum = 0.0;
>> for (i = myid + 1; i <= n; i += numprocs) {
>> x = h*((double)i - 0.5);
>> sum += (4.0/(1.0 + x*x));
>> }
>> mypi = sum*h;
>>
>> told = time;
>> time = MPI_Wtime();
>> comptime = time - told;
>>
>> MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,
>> MPI_COMM_WORLD);
>>
>> told = time;
>> time = MPI_Wtime();
>> reducetime = time - told;
>>
>> if (myid == 0) {
>> totaltime = MPI_Wtime() - starttime;
>> printf("\nElapsed time (total): %f
>> milliseconds\n",totaltime*1000);
>> printf("Elapsed time (Bcast): %f milliseconds
>> (%5.2f%%)\n",bcasttime*1000,bcasttime*100/totaltime);
>> printf("Elapsed time (Reduce): %f milliseconds
>> (%5.2f%%)\n",reducetime*1000,reducetime*100/totaltime);
>> printf("Elapsed time (Comput): %f milliseconds
>> (%5.2f%%)\n",comptime*1000,comptime*100/totaltime);
>> printf("\nApproximated pi is %.16f, Error is %.4e\n", pi,
>> fabs(pi - ACCUPI));
>> }
>> }
>> }
>>
>> MPI_Finalize(); }
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>