Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Very different speed of collective tuned algorithms for alltoallv
From: Daniel Spångberg (daniels_at_[hidden])
Date: 2009-08-29 08:59:08


Dear OpenMPI list,

I noticed a performance problem when increasing the number of CPU's used
to solve my problem. I traced the problem to the MPI_Alltoallv calls. I
turns out the default basic linear algorithm is very sensitive to the
number of CPU's, but the pairwise routine behaves appropriately in my
case. I have performed tests on 16 processes and 24 processes. I have
three 8 core nodes (dual intel quadcore 2.5 GHz), connected with GBE for
these tests. The test sends data (about 12k from each node to every other
node.) I know alltoallv is not the best choice if the data sizes are the
same, but this way it reproduces the situation in my original code.

I have set "coll_tuned_use_dynamic_rules=1" in
$HOME/.openmpi/mca-params.conf

For default runs I used:
time mpirun -np 16 -machinefile hostfile ./testalltoallv
For the basic linear algorithm I used:
time mpirun -np 16 -machinefile hostfile -mca
coll_tuned_alltoallv_algorithm 1 ./testalltoallv
For the pairwise algorithm I used:
time mpirun -np 16 -machinefile hostfile -mca
coll_tuned_alltoallv_algorithm 2 ./testalltoallv

For 24 processes I replaced -np 16 with -np 24. The results (runtime in
seconds):

                      -np 16 -np 24
default 2.1 15.6
basic linear 2.1 15.6
pairwise 2.1 2.8

*******************************************
A speed difference of almost a factor 6 !!!
*******************************************

The test code:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv)
{
   const int data_size=3000;
   int repeat=100;
   int rank,size;
   int i,j;
   int *sendbuf, *sendcount, *senddispl;
   int *recvbuf, *recvcount, *recvdispl;

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
   MPI_Comm_size(MPI_COMM_WORLD,&size);

   sendbuf=malloc(size * data_size * sizeof *sendbuf);
   recvbuf=malloc(size * data_size * sizeof *recvbuf);
   sendcount=malloc(size * sizeof *sendcount);
   senddispl=malloc(size * sizeof *senddispl);
   recvcount=malloc(size * sizeof *recvcount);
   recvdispl=malloc(size * sizeof *recvdispl);

   /* Set up maximum receive lenghts
      (*sizeof(int) because MPI_BYTE is used later on) */
   for (i=0; i<size; i++)
     {
       recvcount[i]=data_size*sizeof(int);
       recvdispl[i]=i*data_size*sizeof(int);
     }

   /* Set up number of data items to send */

   for (i=0; i<size; i++)
       sendcount[i]=data_size*sizeof(int);
   for (i=0; i<size; i++)
       senddispl[i]=i*data_size*sizeof(int);

   /* Do a repetitive test. */
   for (j=0; j<repeat; j++)
     MPI_Alltoallv(sendbuf,sendcount,senddispl,MPI_BYTE,
                  recvbuf,recvcount,recvdispl,MPI_BYTE,
                  MPI_COMM_WORLD);
   MPI_Finalize();
   return 0;
}

The hostfile:
arthur
arthur
arthur
arthur
arthur
arthur
arthur
arthur
trillian
trillian
trillian
trillian
trillian
trillian
trillian
trillian
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod

I am using openmpi 1.3.2.

For me the problem is essentially solved, since I can now change the
algorithm and get reasonable speed for my problem, but I was somewhat
surprised about the very large difference in speed, so I wanted to report
it here, if other users find themselves in a similar situation.

-- 
Daniel Spångberg
Materialkemi
Uppsala Universitet