On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <cap_at_[hidden]> wrote:
> On Tuesday 19 May 2009, Roman Martonak wrote:
> ...
>> openmpi-1.3.2 time per one MD step is 3.66 s
>> ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS
>> = ALL TO ALL COMM 102033. BYTES 4221. =
>> = ALL TO ALL COMM 7.802 MB/S 55.200 SEC =
> ...
>> mvapich-1.1.0 time per one MD step is 2.55 s
>> ELAPSED TIME : 0 HOURS 1 MINUTES 0.65 SECONDS
>> = ALL TO ALL COMM 102033. BYTES 4221. =
>> = ALL TO ALL COMM 14.815 MB/S 29.070 SEC =
> ...
>> Intel MPI 3.2.1.009 time per one MD step is 1.58 s
>> ELAPSED TIME : 0 HOURS 0 MINUTES 38.16 SECONDS
>> = ALL TO ALL COMM 102033. BYTES 4221. =
>> = ALL TO ALL COMM 38.696 MB/S 11.130 SEC =
> ...
>> Clearly the whole difference is basically in the ALL TO ALL COMM time.
>> Running on 1 blade (8 cores) all three MPI implementations have very
>> similar same time per step of about 8.6 s.
>
> My guess is that what you see is the difference in MPI_Alltoall performance
> for the different MPI-implementations (running in your env. on your hw.).
>
> You could write a trivial loop like this and try on the three MPIs:
>
> MPI_init
> for i in 1 to 4221
> MPI_Alltoall(size=102033, ...)
> MPI_finialize
>
> And time it to comfirm this.
>
>> For CPMD I found that using the keyword TASKGROUP
>> which introduces a different way of parallelization it is possible to
>> improve on the openmpi time substantially and lower the time from 3.66
>> s to 1.67 s, almost to the value found with Intel MPI.
>
> I guess this changed what kind of communication is done and you no longer have
> to do 4221x 100Kbytes alltoall that seems to hurt OpenMPI so much.
With TASKGROUP=2 the summary looks as follows
CPU TIME : 0 HOURS 0 MINUTES 42.09 SECONDS
ELAPSED TIME : 0 HOURS 0 MINUTES 44.01 SECONDS
*** CPMD| SIZE OF THE PROGRAM IS 73532/ 322740 kBYTES ***
PROGRAM CPMD ENDED AT: Tue May 19 11:16:18 2009
================================================================
= COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
= SEND/RECEIVE 8585. BYTES 48447. =
= BROADCAST 19063. BYTES 396. =
= GLOBAL SUMMATION 103463. BYTES 372. =
= GLOBAL MULTIPLICATION 0. BYTES 1. =
= ALL TO ALL COMM 231821. BYTES 4221. =
= PERFORMANCE TOTAL TIME =
= SEND/RECEIVE 193.459 MB/S 2.150 SEC =
= BROADCAST 10.785 MB/S 0.700 SEC =
= GLOBAL SUMMATION 339.605 MB/S 0.680 SEC =
= GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
= ALL TO ALL COMM 82.716 MB/S 11.830 SEC =
= SYNCHRONISATION 2.360 SEC =
================================================================
Roman
|