Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Galen M. Shipman (gshipman_at_[hidden])
Date: 2006-02-03 00:27:29


Hello Konstantin,

By using coll_basic_crossover 8 you are forcing all of your
benchmarks to use the basic collectives, which offer poor
performance. I ran the skampi Alltoall benchmark with the tuned
collectives I get the following results which seem to scale quite
well, when I have a bit more time I will provide comparisons with MPICH.

  mpirun -np 8 -mca btl tcp -mca coll self,basic,tuned -mca
mpi_paffinity_alone 1 ./skampi

#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
        2 47.3 0.4 8 47.3 0.4 8
        3 57.9 1.7 40 57.9 1.7 40
        4 65.2 1.5 8 65.2 1.5 8
        5 74.0 2.1 10 74.0 2.1 10
        6 84.3 1.5 8 84.3 1.5 8
        7 89.9 0.4 8 89.9 0.4 8
        8 107.8 1.9 8 107.8 1.9 8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/

        2 1049.0 29.8 8 1049.0 29.8 8
        3 1677.7 49.8 31 1677.7 49.8 31
        4 3287.0 96.8 11 3287.0 96.8 11
        5 3247.3 57.8 8 3247.3 57.8 8
        6 4802.5 98.6 8 4802.5 98.6 8
        7 6166.4 70.3 8 6166.4 70.3 8
        8 7380.8 116.1 8 7380.8 116.1 8

If I use the basic collectives then things do fall apart with long
messages, but this is expected.

  mpirun -np 8 -mca btl tcp -mca coll self,basic -mca
mpi_paffinity_alone 1 ./skampi

#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/

  2 45.7 0.2 8 45.7 0.2 8
        3 55.0 0.9 8 55.0 0.9 8
        4 64.2 0.4 8 64.2 0.4 8
        5 73.4 1.2 8 73.4 1.2 8
        6 83.5 0.5 8 83.5 0.5 8
        7 92.8 1.4 8 92.8 1.4 8
        8 108.1 2.2 8 108.1 2.2 8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/

        2 798.0 1.5 8 798.0 1.5 8
        3 1756.0 38.5 8 1756.0 38.5 8
        4 99601.8 60958.5 5 99601.8 60958.5 5
        5 134846.3 31683.9 11 134846.3 31683.9 11
        6 224243.7 6599.1 11 224243.7 6599.1 11
        7 230021.1 6788.1 10 230021.1 6788.1 10
        8 242596.5 7693.6 6 242596.5 7693.6 6

On Feb 2, 2006, at 5:10 PM, Konstantin Kudin wrote:

> Hi all,
>
> There seem to have been problems with the attachement. Here is the
> report:
>
> I did some tests of Open-MPI version 1.0.2a4r8848. My motivation was
> an extreme degradation of all-to-all MPI performance on 8 cpus (ran
> like 1 cpu). At the same time, MPICH 1.2.7 on 8 cpus runs more like on
> 4 (not like 1 !!!).
>
> This was done using Skampi from :
> http://liinwww.ira.uka.de/~skampi/skampi4.1.tar.gz
>
> The version 4.1 was used.
>
> The system is bunch of a dual Opterons connected by Gigabit.
>
> The MPI operation I am most interested in is all-to-all exchange.
>
> First of all, there seem to be some problems with the logarithmic
> approach. Here is what I mean. In the following, first column is the
> packet size, the next one is the average time (microseconds), then
> goes standard deviation. The test was done on 8 cpus (4 dual nodes).
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 skampi41
> #/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
> #Description of the MPI_Send-MPI_Iprobe_Recv measurement:
> 0 74.3 1.3 8 74.3 1.3 8
> 16 77.4 2.1 8 77.4 2.1 8 0.0
> 0.0
> 32 398.9 323.4 100 398.9 323.4 100 0.0
> 0.0
> 64 80.7 2.3 9 80.7 2.3 9 0.0
> 0.0
> 80 79.3 2.3 13 79.3 2.3 13 0.0
> 0.0
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
> skampi41
> #/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
> #Description of the MPI_Send-MPI_Iprobe_Recv measurement:
> 0 76.7 2.1 8 76.7 2.1 8
> 16 75.8 1.5 8 75.8 1.5 8 0.0
> 0.0
> 32 74.4 0.6 8 74.4 0.6 8 0.0
> 0.0
> 64 76.3 0.4 8 76.3 0.4 8 0.0
> 0.0
> 80 76.7 0.5 8 76.7 0.5 8 0.0
> 0.0
>
> This anomalously large times for certain packet sizes (either 16 or
> 32) without increasing coll_basic_crossover to 8 show up for a whole
> set of tests, so this is not a fluke.
>
> Next, the all-to-all thing. The short test included 64x4 byte
> messages.
> The long one had 16384x4 byte messages.
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
> skampi41
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 12.7 0.2 8 12.7 0.2 8
> 3 56.1 0.3 8 56.1 0.3 8
> 4 69.9 1.8 8 69.9 1.8 8
> 5 87.0 2.2 8 87.0 2.2 8
> 6 99.7 1.5 8 99.7 1.5 8
> 7 122.5 2.2 8 122.5 2.2 8
> 8 147.5 2.5 8 147.5 2.5 8
>
> #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
> 2 188.5 0.3 8 188.5 0.3 8
> 3 1680.5 16.6 8 1680.5 16.6 8
> 4 2759.0 15.5 8 2759.0 15.5 8
> 5 4110.2 34.0 8 4110.2 34.0 8
> 6 75443.5 44383.9 6 75443.5 44383.9 6
> 7 242133.4 870.5 2 242133.4 870.5 2
> 8 252436.7 4016.8 8 252436.7 4016.8 8
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
> \
> -mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 -mca
> btl_tcp_rcvbuf 8388608 skampi41
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 13.1 0.1 8 13.1 0.1 8
> 3 57.4 0.3 8 57.4 0.3 8
> 4 73.7 1.6 8 73.7 1.6 8
> 5 87.1 2.0 8 87.1 2.0 8
> 6 103.7 2.0 8 103.7 2.0 8
> 7 118.3 2.4 8 118.3 2.4 8
> 8 146.7 3.1 8 146.7 3.1 8
>
> #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
> 2 185.8 0.6 8 185.8 0.6 8
> 3 1760.4 17.3 8 1760.4 17.3 8
> 4 2916.8 52.1 8 2916.8 52.1 8
> 5 106993.4 102562.4 2 106993.4 102562.4 2
> 6 260723.1 6679.1 2 260723.1 6679.1 2
> 7 240225.2 6369.8 6 240225.2 6369.8 6
> 8 250848.1 4863.2 6 250848.1 4863.2 6
>
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
> \
> -mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 \
> -mca btl_tcp_rcvbuf 8388608 -mca btl_tcp_min_send_size 32768 \
> -mca btl_tcp_max_send_size 65536 skampi41
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 13.5 0.2 8 13.5 0.2 8
> 3 57.3 1.8 8 57.3 1.8 8
> 4 68.8 0.5 8 68.8 0.5 8
> 5 83.2 0.6 8 83.2 0.6 8
> 6 102.9 1.8 8 102.9 1.8 8
> 7 117.4 2.3 8 117.4 2.3 8
> 8 149.3 2.1 8 149.3 2.1 8
>
> #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
> 2 187.5 0.5 8 187.5 0.5 8
> 3 1661.1 33.4 8 1661.1 33.4 8
> 4 2715.9 6.9 8 2715.9 6.9 8
> 5 116805.2 43036.4 8 116805.2 43036.4 8
> 6 163177.7 41363.4 7 163177.7 41363.4 7
> 7 233105.5 20621.4 2 233105.5 20621.4 2
> 8 332049.5 83860.5 2 332049.5 83860.5 2
>
>
> Same stuff for MPICH 1.2.7 (sockets, no shared memory):
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 312.5 106.5 100 312.5 106.5 100
> 3 546.9 136.2 100 546.9 136.2 100
> 4 2929.7 195.3 100 2929.7 195.3 100
> 5 2070.3 203.7 100 2070.3 203.7 100
> 6 2929.7 170.0 100 2929.7 170.0 100
> 7 1328.1 186.0 100 1328.1 186.0 100
> 8 3203.1 244.4 100 3203.1 244.4 100
>
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 390.6 117.8 100 390.6 117.8 100
> 3 3164.1 252.6 100 3164.1 252.6 100
> 4 5859.4 196.3 100 5859.4 196.3 100
> 5 15234.4 6895.1 30 15234.4 6895.1 30
> 6 18136.2 5563.7 14 18136.2 5563.7 14
> 7 14204.5 2898.0 11 14204.5 2898.0 11
> 8 11718.8 1594.7 4 11718.8 1594.7 4
>
> So, as one can see, MPICH latencies are much higher for small packets,
> yet, things are way more consistent for larger ones. Depending on the
> settings, Open-MPI either degrades at 5 or 6 cpus.
>
> Konstantin
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users