Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] scaling problem with openmpi
From: Pavel Shamis (Pasha) (pashash_at_[hidden])
Date: 2009-05-20 06:08:24


Default algorithm thresholds in mvapich are different from ompi.
Using tunned collectives in Open MPI you may configure the Open MPI
Alltoall threshold as Mvapich defaults.
The follow mca parameters configure Open MPI to use custom rules that
are defined in configure(txt) file.
"--mca use_dynamic_rules 1 --mca dynamic_rules_filename"

Here is example of dynamic_rules_filename that should make Ompi Alltoall
tuning similar to Mvapich:
1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective

Thanks,
Pasha

Peter Kjellstrom wrote:
> On Tuesday 19 May 2009, Peter Kjellstrom wrote:
>
>> On Tuesday 19 May 2009, Roman Martonak wrote:
>>
>>> On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <cap_at_[hidden]> wrote:
>>>
>>>> On Tuesday 19 May 2009, Roman Martonak wrote:
>>>> ...
>>>>
>>>>
>>>>> openmpi-1.3.2 time per one MD step is 3.66 s
>>>>> ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS
>>>>> = ALL TO ALL COMM 102033. BYTES 4221. =
>>>>> = ALL TO ALL COMM 7.802 MB/S 55.200 SEC =
>>>>>
>> ...
>>
>>
>>> With TASKGROUP=2 the summary looks as follows
>>>
>> ...
>>
>>
>>> = ALL TO ALL COMM 231821. BYTES 4221. =
>>> = ALL TO ALL COMM 82.716 MB/S 11.830 SEC =
>>>
>> Wow, according to this it takes 1/5th the time to do the same number (4221)
>> of alltoalls if the size is (roughly) doubled... (ten times better
>> performance with the larger transfer size)
>>
>> Something is not quite right, could you possibly try to run just the
>> alltoalls like I suggested in my previous e-mail?
>>
>
> I was curious so I ran som tests. First it seems that the size reported by
> CPMD is the total size of the data buffer not the message size. Running
> alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):
>
> bw for 4221 x 1595 B : 36.5 Mbytes/s time was: 23.3 s
> bw for 4221 x 3623 B : 125.4 Mbytes/s time was: 15.4 s
> bw for 4221 x 1595 B : 36.4 Mbytes/s time was: 23.3 s
> bw for 4221 x 3623 B : 125.6 Mbytes/s time was: 15.3 s
>
> So it does seem that OpenMPI has some problems with small alltoalls. It is
> obviously broken when you can get things across faster by sending more...
>
> As a reference I ran with a commercial MPI using the same program and node-set
> (I did not have MVAPICH nor IntelMPI on this system):
>
> bw for 4221 x 1595 B : 71.4 Mbytes/s time was: 11.9 s
> bw for 4221 x 3623 B : 125.8 Mbytes/s time was: 15.3 s
> bw for 4221 x 1595 B : 71.1 Mbytes/s time was: 11.9 s
> bw for 4221 x 3623 B : 125.5 Mbytes/s time was: 15.3 s
>
> To see when OpenMPI falls over I ran with an increasing packet size:
>
> bw for 10 x 2900 B : 59.8 Mbytes/s time was: 61.2 ms
> bw for 10 x 2925 B : 59.2 Mbytes/s time was: 62.2 ms
> bw for 10 x 2950 B : 59.4 Mbytes/s time was: 62.6 ms
> bw for 10 x 2975 B : 58.5 Mbytes/s time was: 64.1 ms
> bw for 10 x 3000 B : 113.5 Mbytes/s time was: 33.3 ms
> bw for 10 x 3100 B : 116.1 Mbytes/s time was: 33.6 ms
>
> The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a
> hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst
> case packet size.
>
> These are the figures for my "reference" MPI:
>
> bw for 10 x 2900 B : 110.3 Mbytes/s time was: 33.1 ms
> bw for 10 x 2925 B : 110.4 Mbytes/s time was: 33.4 ms
> bw for 10 x 2950 B : 111.5 Mbytes/s time was: 33.3 ms
> bw for 10 x 2975 B : 112.4 Mbytes/s time was: 33.4 ms
> bw for 10 x 3000 B : 118.2 Mbytes/s time was: 32.0 ms
> bw for 10 x 3100 B : 114.1 Mbytes/s time was: 34.2 ms
>
> Setup-details:
> hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
> sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on
> OFED from CentOS (1.3.2-ish I think).
>
> /Peter
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users