Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] a performance issue of Open MPI Reduce on infiniband cluster.
From: teng ma (xiaok1981_at_[hidden])
Date: 2011-08-31 14:48:29


Dear all:

      I met a performance issue of Open MPI Reduce on infiniband cluster.
I have two clusters, each has 32 nodes. Only difference between these two
clusters is their interconnect:
one gigabyte ethernet cluster and one 20 g infiniband. I configure Open MPI
1.5.3 on both clusters as following:

on infiniband cluster
configure --prefix=/home/tma/opt/ompi153 --with-knem=/opt/knem
--disable-debug --with-openib --enable-mpi-f77 --enable-mpi-f90
--enable-mpi-cxx --disable-vt
on ethernet cluster
../openmpi-1.5.3/configure --prefix=/home/tma/opt/ompi153
--with-knem=/opt/knem --disable-debug --enable-mpi-f77 --enable-mpi-f90
--enable-mpi-cxx --disable-vt

The only difference between these two setups is one uses TCP BTL on ethernet
and another uses Openib BTL on infiniband cluster. I testes it with IMB3.2
using Reduce.
The performance is a little surprising. No matter tuned coll or hierarch
coll, Open MPI's Reduce Gigabyte ethernet cluster is faster than it on 20g
infiniband cluster, sometimes by 10 times like followings:

I used the same mapping between cores and processes on both clusters. I
also use mvapich2-1.7 on infiniband cluster to test ininiband cluster's
hardware, and their results look normal to me and can beat or close to the
runtime on ethernet cluster. It looks like infiniband cluster's hardware
works correctly from mvapich2's result. I also did not see too much
difference from pingpong test between nodes no matter open mpi or mvapich2
on infiniband @ latency or BW. Is there any possible for OpenIB BTL to
degrade performance with large amount of processes involved into
communicator in special communication patters. I did not see this phenomenon
at other operations e.g. broadcast or allgather. Is there any methods to
tune openib BTL on large scale?

Tuned on ethernet cluster

#----------------------------------------------------------------
# Benchmarking Reduce
# #processes = 768
#----------------------------------------------------------------
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
            0 1000 0.08 0.39 0.09
            4 1000 222.57 224.46 223.63
            8 1000 190.48 195.22 194.38
           16 1000 253.82 255.50 254.78
           32 1000 224.99 227.10 226.23
           64 1000 211.55 213.69 212.78
          128 1000 206.82 208.64 207.73
          256 1000 236.94 241.66 238.80
          512 1000 299.05 301.38 300.35
         1024 1000 437.94 441.51 440.01
         2048 1000 1032.04 1040.56 1037.91
         4096 1000 1507.09 1516.15 1512.01
         8192 1000 1871.93 1884.96 1879.88
        16384 1000 2836.64 2853.25 2846.76
        32768 1000 4217.10 4236.68 4227.78
        65536 423 23731.68 23808.00 23773.62
       131072 160 54681.98 57174.22 56774.75
       262144 107 76799.20 81679.58 80494.88
       524288 80 81618.96 94513.69 89345.02
      1048576 40 120627.98 147049.20 132740.58
      2097152 20 47541.80 53145.49 51015.37
      4194304 10 402117.49 407616.31 406311.63
      8388608 5 126613.00 143827.63 141219.01
     16777216 2 204279.07 236565.95 228780.02
     33554432 1 276933.91 398322.11 346959.91
     67108864 1 565240.86 709377.05 655559.84

Tuned on infiniband cluster
            0 1000 0.06 0.09 0.06
            4 1000 12.20 15.05 13.45
            8 1000 10.43 11.34 10.81
           16 1000 10.32 11.55 10.80
           32 1000 9.87 10.78 10.31
           64 1000 9.92 11.13 10.64
          128 1000 2.70 84.15 5.65
          256 1000 2.45 16.78 5.27
          512 1000 5.97 22.52 9.00
         1024 1000 3.94 122.85 8.24
         2048 1000 7.18 168.68 16.86
         4096 1000 14.61 1008.60 79.12
         8192 1000 46.71 2152.43 164.35
        16384 1000 221.19 4381.02 422.91
        32768 775 1875.63 6993.05 2227.96
        65536 640 99.90 15416.71 1138.07
       131072 320 1465.10 52405.28 3230.43
       262144 160 392.92 97336.44 4038.41
       524288 80 796.63 108368.69 7415.07
      1048576 40 1650.22 106789.92 14436.31
      2097152 16 5900.06 635461.49 350146.52
      4194304 10 11706.11 1196665.50 724550.82
      8388608 5 24076.41 1889796.59 1012426.41
     16777216 2 45678.50 2334099.41 1311503.63
     33554432 1 2038235.90 3204766.99 2922900.88
     67108864 1 6048538.92 6359004.02 6224545.86

Mvapich2-1.7
on infiniband cluster
#----------------------------------------------------------------
# Benchmarking Reduce
# #processes = 768
#----------------------------------------------------------------
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
            0 1000 0.09 0.11 0.10
            4 1000 6.16 53.48 14.07
            8 1000 6.08 57.17 14.62
           16 1000 6.18 57.85 14.75
           32 1000 6.26 64.25 15.16
           64 1000 6.98 56.46 17.91
          128 1000 7.90 52.12 19.65
          256 1000 10.12 54.41 21.08
          512 1000 13.73 58.44 25.67
         1024 1000 21.82 53.39 31.80
         2048 1000 40.11 71.48 48.80
         4096 1000 76.37 115.03 86.57
         8192 1000 87.21 248.86 111.28
        16384 1000 523.64 524.47 524.06
        32768 1000 612.20 613.11 612.67
        65536 640 693.71 694.97 694.43
       131072 320 1015.76 1019.67 1017.85
       262144 160 2066.41 2080.15 2074.26
       524288 80 4237.09 4292.31 4270.24
      1048576 40 9458.45 9691.35 9600.54
      2097152 20 32027.40 39877.95 35711.22
      4194304 10 60754.99 65029.00 63449.02
      8388608 5 79504.20 96219.40 90051.34
     16777216 2 117237.45 197361.47 167624.00
     33554432 1 133003.00 475266.93 299665.71
     67108864 1 265702.01 831732.03 598579.16

Teng Ma