Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] scaling problem with openmpi
From: Roman Martonak (r.martonak_at_[hidden])
Date: 2009-05-19 08:20:57


I am using CPMD 3.11.1, not cp2k. Below are the timings for 20 steps
of MD for 32 water molecules (one of standard CPMD benchmarks) with
openmpi, mvapich and Intel MPI, running on 64 cores (8 blades, each
has 2 quad-core 2.2 GHz AMD Barcelona CPUs).

openmpi-1.3.2 time per one MD step is 3.66 s
summary:
       CPU TIME : 0 HOURS 1 MINUTES 23.85 SECONDS
   ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS
 *** CPMD| SIZE OF THE PROGRAM IS 70020/ 319128 kBYTES ***

 PROGRAM CPMD ENDED AT: Tue May 19 11:12:06 2009

 ================================================================
 = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
 = SEND/RECEIVE 8585. BYTES 48447. =
 = BROADCAST 19063. BYTES 396. =
 = GLOBAL SUMMATION 32010. BYTES 329. =
 = GLOBAL MULTIPLICATION 0. BYTES 1. =
 = ALL TO ALL COMM 102033. BYTES 4221. =
 = PERFORMANCE TOTAL TIME =
 = SEND/RECEIVE 209.014 MB/S 1.990 SEC =
 = BROADCAST 10.485 MB/S 0.720 SEC =
 = GLOBAL SUMMATION 154.115 MB/S 0.410 SEC =
 = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
 = ALL TO ALL COMM 7.802 MB/S 55.200 SEC =
 = SYNCHRONISATION 2.440 SEC =
 ================================================================

mvapich-1.1.0 time per one MD step is 2.55 s
summary:
       CPU TIME : 0 HOURS 0 MINUTES 59.79 SECONDS
   ELAPSED TIME : 0 HOURS 1 MINUTES 0.65 SECONDS
 *** CPMD| SIZE OF THE PROGRAM IS 59072/ 182960 kBYTES ***

 PROGRAM CPMD ENDED AT: Tue May 19 10:34:56 2009

 ================================================================
 = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
 = SEND/RECEIVE 8585. BYTES 48447. =
 = BROADCAST 19063. BYTES 396. =
 = GLOBAL SUMMATION 32010. BYTES 329. =
 = GLOBAL MULTIPLICATION 0. BYTES 1. =
 = ALL TO ALL COMM 102033. BYTES 4221. =
 = PERFORMANCE TOTAL TIME =
 = SEND/RECEIVE 170.466 MB/S 2.440 SEC =
 = BROADCAST 6.863 MB/S 1.100 SEC =
 = GLOBAL SUMMATION 61.948 MB/S 1.020 SEC =
 = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
 = ALL TO ALL COMM 14.815 MB/S 29.070 SEC =
 = SYNCHRONISATION 0.400 SEC =
 ================================================================

Intel MPI 3.2.1.009 time per one MD step is 1.58 s

summary:
       CPU TIME : 0 HOURS 0 MINUTES 36.11 SECONDS
   ELAPSED TIME : 0 HOURS 0 MINUTES 38.16 SECONDS
 *** CPMD| SIZE OF THE PROGRAM IS 65196/ 178736 kBYTES ***

 PROGRAM CPMD ENDED AT: Tue May 19 10:17:17 2009

 ================================================================
 = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
 = SEND/RECEIVE 8585. BYTES 48447. =
 = BROADCAST 19063. BYTES 396. =
 = GLOBAL SUMMATION 32010. BYTES 329. =
 = GLOBAL MULTIPLICATION 0. BYTES 1. =
 = ALL TO ALL COMM 102033. BYTES 4221. =
 = PERFORMANCE TOTAL TIME =
 = SEND/RECEIVE 815.562 MB/S 0.510 SEC =
 = BROADCAST 754.914 MB/S 0.010 SEC =
 = GLOBAL SUMMATION 180.535 MB/S 0.350 SEC =
 = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
 = ALL TO ALL COMM 38.696 MB/S 11.130 SEC =
 = SYNCHRONISATION 0.550 SEC =
 ================================================================

Clearly the whole difference is basically in the ALL TO ALL COMM time.
Running on 1 blade (8 cores) all three MPI implementations have very
similar same time per step of about 8.6 s. Openmpi was ran with the
--mca mpi_paffinity_alone 1 option, in mvapich and IntelMPI no
particular option was used. I was told by HP that there could be an
increased latency when all 8 cores in one blade communicate via a
single port HCA to Infiniband fabric but even if that is the case I am
still wondering how there can be such difference between the
implementations. For CPMD I found that using the keyword TASKGROUP
which introduces a different way of parallelization it is possible to
improve on the openmpi time substantially and lower the time from 3.66
s to 1.67 s, almost to the value found with Intel MPI. Is there
perhaps any openmpi parameter that could be tuned to help the scaling,
without the use of TASKGROUP (maybe some tuning of collective
operations) ?

Thanks, best regards

Roman

On Mon, May 18, 2009 at 6:58 PM, Noam Bernstein
<noam.bernstein_at_[hidden]> wrote:
>
> On May 18, 2009, at 12:50 PM, Pavel Shamis (Pasha) wrote:
>
>> Roman,
>> Can you please share with us Mvapich numbers that you get . Also what is
>> mvapich version that you use.
>> Default mvapich and openmpi IB tuning is very similar, so it is strange to
>> see so big difference. Do you know what kind of collectives operation is
>> used in this specific application.
>
> This code does a bunch of parallel things in various different places
> (mostly dense matrix math, and some FFT stuff that may or may not
> be parallelized).  In the standard output there's a summary of the time
> taken by various MPI routines.  Perhaps Roman can send them?  The
> code also uses ScaLAPACK, but I'm not sure how CP2K labels the
> timing for those routines in the output.
>
>                                                                        Noam
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>