Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-02-03 02:11:49


Greetings Konstantin.

Many thanks for this report. Another user submitted almost the same
issue earlier today (poor performance of Open MPI 1.0.x collectives;
see http://www.open-mpi.org/community/lists/users/2006/02/0558.php).

Let me provide an additional clarification on Galen's reply:

The collectives in Open MPI 1.0.x are known to be sub-optimal -- they
return correct results, but they are not optimized at all. This is
what Galen meant by "If I use the basic collectives then things do
fall apart with long messages, but this is expected". The
collectives in the Open MPI 1.1.x series (i.e., our current
development trunk) provide *much* better performance.

Galen ran his tests using the "tuned" collective module in the 1.1.x
series -- these are the "better" collectives that I referred to
above. This "tuned" module does not exist in the 1.0.x series.

You can download a 1.1.x nightly snapshot -- including the new
"tuned" module -- from here:

        http://www.open-mpi.org/nightly/trunk/

If you get the opportunity, could you re-try your application with a
1.1 snapshot?

On Feb 2, 2006, at 6:10 PM, Konstantin Kudin wrote:

> Hi all,
>
> There seem to have been problems with the attachement. Here is the
> report:
>
> I did some tests of Open-MPI version 1.0.2a4r8848. My motivation was
> an extreme degradation of all-to-all MPI performance on 8 cpus (ran
> like 1 cpu). At the same time, MPICH 1.2.7 on 8 cpus runs more like on
> 4 (not like 1 !!!).
>
> This was done using Skampi from :
> http://liinwww.ira.uka.de/~skampi/skampi4.1.tar.gz
>
> The version 4.1 was used.
>
> The system is bunch of a dual Opterons connected by Gigabit.
>
> The MPI operation I am most interested in is all-to-all exchange.
>
> First of all, there seem to be some problems with the logarithmic
> approach. Here is what I mean. In the following, first column is the
> packet size, the next one is the average time (microseconds), then
> goes standard deviation. The test was done on 8 cpus (4 dual nodes).
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 skampi41
> #/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
> #Description of the MPI_Send-MPI_Iprobe_Recv measurement:
> 0 74.3 1.3 8 74.3 1.3 8
> 16 77.4 2.1 8 77.4 2.1 8 0.0
> 0.0
> 32 398.9 323.4 100 398.9 323.4 100 0.0
> 0.0
> 64 80.7 2.3 9 80.7 2.3 9 0.0
> 0.0
> 80 79.3 2.3 13 79.3 2.3 13 0.0
> 0.0
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
> skampi41
> #/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
> #Description of the MPI_Send-MPI_Iprobe_Recv measurement:
> 0 76.7 2.1 8 76.7 2.1 8
> 16 75.8 1.5 8 75.8 1.5 8 0.0
> 0.0
> 32 74.4 0.6 8 74.4 0.6 8 0.0
> 0.0
> 64 76.3 0.4 8 76.3 0.4 8 0.0
> 0.0
> 80 76.7 0.5 8 76.7 0.5 8 0.0
> 0.0
>
> This anomalously large times for certain packet sizes (either 16 or
> 32) without increasing coll_basic_crossover to 8 show up for a whole
> set of tests, so this is not a fluke.
>
> Next, the all-to-all thing. The short test included 64x4 byte
> messages.
> The long one had 16384x4 byte messages.
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
> skampi41
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 12.7 0.2 8 12.7 0.2 8
> 3 56.1 0.3 8 56.1 0.3 8
> 4 69.9 1.8 8 69.9 1.8 8
> 5 87.0 2.2 8 87.0 2.2 8
> 6 99.7 1.5 8 99.7 1.5 8
> 7 122.5 2.2 8 122.5 2.2 8
> 8 147.5 2.5 8 147.5 2.5 8
>
> #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
> 2 188.5 0.3 8 188.5 0.3 8
> 3 1680.5 16.6 8 1680.5 16.6 8
> 4 2759.0 15.5 8 2759.0 15.5 8
> 5 4110.2 34.0 8 4110.2 34.0 8
> 6 75443.5 44383.9 6 75443.5 44383.9 6
> 7 242133.4 870.5 2 242133.4 870.5 2
> 8 252436.7 4016.8 8 252436.7 4016.8 8
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
> \
> -mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 -mca
> btl_tcp_rcvbuf 8388608 skampi41
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 13.1 0.1 8 13.1 0.1 8
> 3 57.4 0.3 8 57.4 0.3 8
> 4 73.7 1.6 8 73.7 1.6 8
> 5 87.1 2.0 8 87.1 2.0 8
> 6 103.7 2.0 8 103.7 2.0 8
> 7 118.3 2.4 8 118.3 2.4 8
> 8 146.7 3.1 8 146.7 3.1 8
>
> #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
> 2 185.8 0.6 8 185.8 0.6 8
> 3 1760.4 17.3 8 1760.4 17.3 8
> 4 2916.8 52.1 8 2916.8 52.1 8
> 5 106993.4 102562.4 2 106993.4 102562.4 2
> 6 260723.1 6679.1 2 260723.1 6679.1 2
> 7 240225.2 6369.8 6 240225.2 6369.8 6
> 8 250848.1 4863.2 6 250848.1 4863.2 6
>
>
>> mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
> \
> -mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 \
> -mca btl_tcp_rcvbuf 8388608 -mca btl_tcp_min_send_size 32768 \
> -mca btl_tcp_max_send_size 65536 skampi41
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 13.5 0.2 8 13.5 0.2 8
> 3 57.3 1.8 8 57.3 1.8 8
> 4 68.8 0.5 8 68.8 0.5 8
> 5 83.2 0.6 8 83.2 0.6 8
> 6 102.9 1.8 8 102.9 1.8 8
> 7 117.4 2.3 8 117.4 2.3 8
> 8 149.3 2.1 8 149.3 2.1 8
>
> #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
> 2 187.5 0.5 8 187.5 0.5 8
> 3 1661.1 33.4 8 1661.1 33.4 8
> 4 2715.9 6.9 8 2715.9 6.9 8
> 5 116805.2 43036.4 8 116805.2 43036.4 8
> 6 163177.7 41363.4 7 163177.7 41363.4 7
> 7 233105.5 20621.4 2 233105.5 20621.4 2
> 8 332049.5 83860.5 2 332049.5 83860.5 2
>
>
> Same stuff for MPICH 1.2.7 (sockets, no shared memory):
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 312.5 106.5 100 312.5 106.5 100
> 3 546.9 136.2 100 546.9 136.2 100
> 4 2929.7 195.3 100 2929.7 195.3 100
> 5 2070.3 203.7 100 2070.3 203.7 100
> 6 2929.7 170.0 100 2929.7 170.0 100
> 7 1328.1 186.0 100 1328.1 186.0 100
> 8 3203.1 244.4 100 3203.1 244.4 100
>
> #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
> 2 390.6 117.8 100 390.6 117.8 100
> 3 3164.1 252.6 100 3164.1 252.6 100
> 4 5859.4 196.3 100 5859.4 196.3 100
> 5 15234.4 6895.1 30 15234.4 6895.1 30
> 6 18136.2 5563.7 14 18136.2 5563.7 14
> 7 14204.5 2898.0 11 14204.5 2898.0 11
> 8 11718.8 1594.7 4 11718.8 1594.7 4
>
> So, as one can see, MPICH latencies are much higher for small packets,
> yet, things are way more consistent for larger ones. Depending on the
> settings, Open-MPI either degrades at 5 or 6 cpus.
>
> Konstantin
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/