Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] scaling problem with openmpi
From: Peter Kjellstrom (cap_at_[hidden])
Date: 2009-05-20 04:39:09


On Tuesday 19 May 2009, Peter Kjellstrom wrote:
> On Tuesday 19 May 2009, Roman Martonak wrote:
> > On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <cap_at_[hidden]> wrote:
> > > On Tuesday 19 May 2009, Roman Martonak wrote:
> > > ...
> > >
> > >> openmpi-1.3.2                           time per one MD step is 3.66 s
> > >>    ELAPSED TIME :    0 HOURS  1 MINUTES 25.90 SECONDS
> > >>  = ALL TO ALL COMM           102033. BYTES               4221.  =
> > >>  = ALL TO ALL COMM             7.802  MB/S          55.200 SEC  =
>
> ...
>
> > With TASKGROUP=2 the summary looks as follows
>
> ...
>
> > = ALL TO ALL COMM 231821. BYTES 4221. =
> > = ALL TO ALL COMM 82.716 MB/S 11.830 SEC =
>
> Wow, according to this it takes 1/5th the time to do the same number (4221)
> of alltoalls if the size is (roughly) doubled... (ten times better
> performance with the larger transfer size)
>
> Something is not quite right, could you possibly try to run just the
> alltoalls like I suggested in my previous e-mail?

I was curious so I ran som tests. First it seems that the size reported by
CPMD is the total size of the data buffer not the message size. Running
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):

bw for 4221 x 1595 B : 36.5 Mbytes/s time was: 23.3 s
bw for 4221 x 3623 B : 125.4 Mbytes/s time was: 15.4 s
bw for 4221 x 1595 B : 36.4 Mbytes/s time was: 23.3 s
bw for 4221 x 3623 B : 125.6 Mbytes/s time was: 15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is
obviously broken when you can get things across faster by sending more...

As a reference I ran with a commercial MPI using the same program and node-set
(I did not have MVAPICH nor IntelMPI on this system):

bw for 4221 x 1595 B : 71.4 Mbytes/s time was: 11.9 s
bw for 4221 x 3623 B : 125.8 Mbytes/s time was: 15.3 s
bw for 4221 x 1595 B : 71.1 Mbytes/s time was: 11.9 s
bw for 4221 x 3623 B : 125.5 Mbytes/s time was: 15.3 s

To see when OpenMPI falls over I ran with an increasing packet size:

bw for 10 x 2900 B : 59.8 Mbytes/s time was: 61.2 ms
bw for 10 x 2925 B : 59.2 Mbytes/s time was: 62.2 ms
bw for 10 x 2950 B : 59.4 Mbytes/s time was: 62.6 ms
bw for 10 x 2975 B : 58.5 Mbytes/s time was: 64.1 ms
bw for 10 x 3000 B : 113.5 Mbytes/s time was: 33.3 ms
bw for 10 x 3100 B : 116.1 Mbytes/s time was: 33.6 ms

The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a
hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst
case packet size.

These are the figures for my "reference" MPI:

bw for 10 x 2900 B : 110.3 Mbytes/s time was: 33.1 ms
bw for 10 x 2925 B : 110.4 Mbytes/s time was: 33.4 ms
bw for 10 x 2950 B : 111.5 Mbytes/s time was: 33.3 ms
bw for 10 x 2975 B : 112.4 Mbytes/s time was: 33.4 ms
bw for 10 x 3000 B : 118.2 Mbytes/s time was: 32.0 ms
bw for 10 x 3100 B : 114.1 Mbytes/s time was: 34.2 ms

Setup-details:
hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on
OFED from CentOS (1.3.2-ish I think).

/Peter