Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] scaling problem with openmpi
From: Rolf Vandevaart (Rolf.Vandevaart_at_[hidden])
Date: 2009-05-20 08:58:29


The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules

You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight into all the various algorithms that make up
the tuned collectives.

If I am understanding what is happening, it looks like the original
MPI_Alltoall made use of three algorithms. (You can look in
coll_tuned_decision_fixed.c)

If message size < 200 or communicator size > 12
   bruck
else if message size < 3000
   basic linear
else
   pairwise
end

With the file Pavel has provided things have changed to the following.
(maybe someone can confirm)

If message size < 8192
   bruck
else
   pairwise
end

Rolf

On 05/20/09 07:48, Roman Martonak wrote:
> Many thanks for the highly helpful analysis. Indeed, what Peter says
> seems to be precisely the case here. I tried to run the 32 waters test
> on 48 cores now, with the original cutoff of 100 Ry, and with slightly
> increased one of 110 Ry. Normally with larger cutoff it should
> obviously take more time for one step. Increasing cutoff however also
> increases the size of the data buffer and it appears just to cross the
> packet size threshold for different behaviour (test was ran with
> openmpi-1.3.2).
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------
> cutoff 100Ry
>
> time per 1 step is 2.869 s
>
> = ALL TO ALL COMM 151583. BYTES 2211. =
> = ALL TO ALL COMM 16.741 MB/S 20.020 SEC =
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------
> cutoff 110 Ry
>
> time per 1 step is 1.879 s
>
> = ALL TO ALL COMM 167057. BYTES 2211. =
> = ALL TO ALL COMM 43.920 MB/S 8.410 SEC =
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------
> so it actually runs much faster and ALL TO ALL COMM is 2.6 times
> faster. In my case the threshold seems to be somewhere between
> 167057/48 = 3 480 and 151583/48 = 3 157 bytes.
>
> I saved the text that Pavel suggested
>
> 1 # num of collectives
> 3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
> 1 # number of com sizes
> 64 # comm size 8
> 2 # number of msg sizes
> 0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
> 8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
> # end of first collective
>
> to the file dyn_rules and tried to run appending the options
> "--mca use_dynamic_rules 1 --mca dynamic_rules_filename ./dyn_rules" to mpirun
> but it does not make any change. Is this the correct syntax to enable
> the rules ?
> And will the above sample file shift the threshold to lower values (to
> what value) ?
>
> Best regards
>
> Roman
>
> On Wed, May 20, 2009 at 10:39 AM, Peter Kjellstrom <cap_at_[hidden]> wrote:
>> On Tuesday 19 May 2009, Peter Kjellstrom wrote:
>>> On Tuesday 19 May 2009, Roman Martonak wrote:
>>>> On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <cap_at_[hidden]> wrote:
>>>>> On Tuesday 19 May 2009, Roman Martonak wrote:
>>>>> ...
>>>>>
>>>>>> openmpi-1.3.2 time per one MD step is 3.66 s
>>>>>> ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS
>>>>>> = ALL TO ALL COMM 102033. BYTES 4221. =
>>>>>> = ALL TO ALL COMM 7.802 MB/S 55.200 SEC =
>>> ...
>>>
>>>> With TASKGROUP=2 the summary looks as follows
>>> ...
>>>
>>>> = ALL TO ALL COMM 231821. BYTES 4221. =
>>>> = ALL TO ALL COMM 82.716 MB/S 11.830 SEC =
>>> Wow, according to this it takes 1/5th the time to do the same number (4221)
>>> of alltoalls if the size is (roughly) doubled... (ten times better
>>> performance with the larger transfer size)
>>>
>>> Something is not quite right, could you possibly try to run just the
>>> alltoalls like I suggested in my previous e-mail?
>> I was curious so I ran som tests. First it seems that the size reported by
>> CPMD is the total size of the data buffer not the message size. Running
>> alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):
>>
>> bw for 4221 x 1595 B : 36.5 Mbytes/s time was: 23.3 s
>> bw for 4221 x 3623 B : 125.4 Mbytes/s time was: 15.4 s
>> bw for 4221 x 1595 B : 36.4 Mbytes/s time was: 23.3 s
>> bw for 4221 x 3623 B : 125.6 Mbytes/s time was: 15.3 s
>>
>> So it does seem that OpenMPI has some problems with small alltoalls. It is
>> obviously broken when you can get things across faster by sending more...
>>
>> As a reference I ran with a commercial MPI using the same program and node-set
>> (I did not have MVAPICH nor IntelMPI on this system):
>>
>> bw for 4221 x 1595 B : 71.4 Mbytes/s time was: 11.9 s
>> bw for 4221 x 3623 B : 125.8 Mbytes/s time was: 15.3 s
>> bw for 4221 x 1595 B : 71.1 Mbytes/s time was: 11.9 s
>> bw for 4221 x 3623 B : 125.5 Mbytes/s time was: 15.3 s
>>
>> To see when OpenMPI falls over I ran with an increasing packet size:
>>
>> bw for 10 x 2900 B : 59.8 Mbytes/s time was: 61.2 ms
>> bw for 10 x 2925 B : 59.2 Mbytes/s time was: 62.2 ms
>> bw for 10 x 2950 B : 59.4 Mbytes/s time was: 62.6 ms
>> bw for 10 x 2975 B : 58.5 Mbytes/s time was: 64.1 ms
>> bw for 10 x 3000 B : 113.5 Mbytes/s time was: 33.3 ms
>> bw for 10 x 3100 B : 116.1 Mbytes/s time was: 33.6 ms
>>
>> The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a
>> hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst
>> case packet size.
>>
>> These are the figures for my "reference" MPI:
>>
>> bw for 10 x 2900 B : 110.3 Mbytes/s time was: 33.1 ms
>> bw for 10 x 2925 B : 110.4 Mbytes/s time was: 33.4 ms
>> bw for 10 x 2950 B : 111.5 Mbytes/s time was: 33.3 ms
>> bw for 10 x 2975 B : 112.4 Mbytes/s time was: 33.4 ms
>> bw for 10 x 3000 B : 118.2 Mbytes/s time was: 32.0 ms
>> bw for 10 x 3100 B : 114.1 Mbytes/s time was: 34.2 ms
>>
>> Setup-details:
>> hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
>> sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on
>> OFED from CentOS (1.3.2-ish I think).
>>
>> /Peter
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
=========================
rolf.vandevaart_at_[hidden]
781-442-3043
=========================