Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with overlapping communication with calculation
From: Daniel Spångberg (daniels_at_[hidden])
Date: 2009-03-25 11:05:56


Dear list,

The bad behaviour now only occurs with version 1.2.X of openmpi (I have
tried 1.2.5, 1.2.8 and 1.2.9 with gcc and 1.2.7 and 1.2.9 with pgi cc.
Problem is in all of those.). With 1.3.1 I can find no problem at all. So
perhaps that means that the problem is solved?

mpirun -np 4 ./test4|head
Sum should be 60
Sum should be 60
Sum should be 60
Sum should be 60
Result on rank 1 strangely is 50
Result on rank 1 strangely is 30
Result on rank 3 strangely is 90
Result on rank 3 strangely is 80
Result on rank 0 strangely is 50
Result on rank 1 strangely is 40

Without IB there is no problem:
mpirun -mca btl self,tcp -np 4 ./test4
Sum should be 60
Sum should be 60
Sum should be 60
Sum should be 60

The full (bug fixed code):

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char **argv)
{
   int rank,size,i,j,k;
   const int arrlen=10;
   const int repeattest=1000000;
   double *array;
   MPI_Request *reqarr;
   MPI_Status *mpistat;
   MPI_Datatype STRIDED;
   int torank,fromrank,nreq;
   int sumshouldbe;
   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
   MPI_Comm_size(MPI_COMM_WORLD,&size);

   /* Non-contiguous data */
   MPI_Type_vector(arrlen,1,size,MPI_DOUBLE,&STRIDED);
   MPI_Type_commit(&STRIDED);

   array=malloc(arrlen*size *sizeof *array);
   reqarr=malloc(2*size*sizeof *reqarr);
   mpistat=malloc(2*size*sizeof *mpistat);

   /* Setup communication */
   sumshouldbe=0;
   nreq=0;
   for (i=1; i<size; i++)
     {
       torank=rank+i;
       if (torank>=size)
         torank-=size;
       fromrank=rank-i;
       if (fromrank<0)
         fromrank+=size;
       MPI_Recv_init(array+i,1,STRIDED,fromrank,i,MPI_COMM_WORLD,reqarr+nreq);
       nreq++;
       MPI_Send_init(array,1,STRIDED,torank,i,MPI_COMM_WORLD,reqarr+nreq);
       nreq++;
       sumshouldbe+=i;
     }
   printf("Sum should be %g\n",(double)arrlen*sumshouldbe);
   /* Do the tests. */
   for (j=0; j<repeattest; j++)
     {
       double sum=0.;
       /* Init test arrays. Array on first process is initially all
          zero. On second process all one, etc. Same as rank number. */
       for (i=0; i<arrlen*size; i++)
         array[i]=(double)rank;

       /* Start communication */
       MPI_Startall(nreq,reqarr);

       /* Accumulate part of arrays that are not communicated. This
          touches the parts of the arrays that are *not*
          communicated!! */
       for (i=0; i<arrlen; i++)
         sum+=array[i*size];

       /* Wait for communication to finish */
       MPI_Waitall(nreq,reqarr,mpistat);

       /* Accumulate part of arrays that have been communicated. */
       for (i=0; i<arrlen; i++)
         {
           for (k=0; k<size-1; k++)
             sum+=array[i*size+1+k];
         }

       if (sum!=arrlen*sumshouldbe)
         printf("Result on rank %d strangely is %g\n",rank,sum);
     }

   MPI_Finalize();
   return 0;
}

Details about the computer & os is in the original mail (quoted below).

Daniel Spångberg

Den 2009-03-25 14:52:16 skrev Daniel Spångberg <daniels_at_[hidden]>:

> Dear list,
>
> A colleague pointed out an error in my test code. The final loop should
> not be
> for (i=0; i<arrlen*(size-1); i++)
> but rather
> for (i=0; i<arrlen; i++)
>
> details, details... Anyway, I still get problems from time to time with
> this test code, but I have not yet had time to figure out the
> circumstances when this happens. I will report back to this list once I
> know what's going on.
>
> Sorry to trouble you too early!
>
> Daniel Spångberg
>
>
> Den 2009-03-25 09:44:37 skrev Daniel Spångberg <daniels_at_[hidden]>:
>
>> Dear list,
>>
>> We've found a problem with openmpi when running over IB when
>> calculation reading elements of an array is overlapping communication
>> to other elements (that are not used in the calculation) of the same
>> array. I have written a small test program (below) that shows this
>> behaviour. When the array is small (arrlen in the code), more problems
>> occur. The problems only occur when using IB (even on the same node!?),
>> using mpirun -mca btl tcp,self the problem vanishes.
>>
>> The behaviour with 1.2.9 and 1.3.1 is slightly different, where
>> problems occur already for 3 processes with openmpi 1.2.9 but 4
>> processes are required for problems with 1.3.1. Proper output on 4
>> processes should just be:
>> Sum should be 60
>> Sum should be 60
>> Sum should be 60
>> Sum should be 60
>>
>> With IB:
>> mpirun -np 4 ./test3|head
>> Sum should be 60
>> Sum should be 60
>> Sum should be 60
>> Sum should be 60
>> Result on rank 0 strangely is 1.06316e+248
>> Result on rank 2 strangely is 1.54396e+262
>> Result on rank 3 strangely is 3.87325e+233
>> Result on rank 1 strangely is 1.54396e+262
>> Result on rank 1 strangely is 1.54396e+262
>> Result on rank 2 strangely is 1.54396e+262
>>
>>
>> Info about the system:
>>
>> openmpi: 1.2.9, 1.3.1
>>
>> From ompi_info:
>> MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.1)
>>
>> From lspci:
>> 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
>>
>> configure picks up ibverbs:
>> --- MCA component btl:ofud (m4 configuration macro)
>> checking for MCA component btl:ofud compile mode... dso
>> checking --with-openib value... simple ok (unspecified)
>> checking --with-openib-libdir value... simple ok (unspecified)
>> checking for fcntl.h... (cached) yes
>> checking sys/poll.h usability... yes
>> checking sys/poll.h presence... yes
>> checking for sys/poll.h... yes
>> checking infiniband/verbs.h usability... yes
>> checking infiniband/verbs.h presence... yes
>> checking for infiniband/verbs.h... yes
>> looking for library without search path
>> checking for ibv_open_device in -libverbs... yes
>> checking number of arguments to ibv_create_cq... 5
>> checking whether IBV_EVENT_CLIENT_REREGISTER is declared... yes
>> checking for ibv_get_device_list... yes
>> checking for ibv_resize_cq... yes
>> checking for struct ibv_device.transport_type... yes
>> checking for ibv_create_xrc_rcv_qp... no
>> checking rdma/rdma_cma.h usability... yes
>> checking rdma/rdma_cma.h presence... yes
>> checking for rdma/rdma_cma.h... yes
>> checking for rdma_create_id in -lrdmacm... yes
>> checking for rdma_get_peer_addr... yes
>> checking for infiniband/driver.h... yes
>> checking if ConnectX XRC support is enabled... no
>> checking if OpenFabrics RDMACM support is enabled... yes
>> checking if OpenFabrics IBCM support is enabled... no
>> checking if MCA component btl:ofud can compile... yes
>>
>> --- MCA component btl:openib (m4 configuration macro)
>> checking for MCA component btl:openib compile mode... dso
>> checking --with-openib value... simple ok (unspecified)
>> checking --with-openib-libdir value... simple ok (unspecified)
>> checking for fcntl.h... (cached) yes
>> checking for sys/poll.h... (cached) yes
>> checking infiniband/verbs.h usability... yes
>> checking infiniband/verbs.h presence... yes
>> checking for infiniband/verbs.h... yes
>> looking for library without search path
>> checking for ibv_open_device in -libverbs... yes
>> checking number of arguments to ibv_create_cq... (cached) 5
>> checking whether IBV_EVENT_CLIENT_REREGISTER is declared... (cached) yes
>> checking for ibv_get_device_list... (cached) yes
>> checking for ibv_resize_cq... (cached) yes
>> checking for struct ibv_device.transport_type... (cached) yes
>> checking for ibv_create_xrc_rcv_qp... (cached) no
>> checking for rdma/rdma_cma.h... (cached) yes
>> checking for rdma_create_id in -lrdmacm... (cached) yes
>> checking for rdma_get_peer_addr... yes
>> checking for infiniband/driver.h... (cached) yes
>> checking if ConnectX XRC support is enabled... no
>> checking if OpenFabrics RDMACM support is enabled... yes
>> checking if OpenFabrics IBCM support is enabled... no
>> checking for ibv_fork_init... yes
>> checking for thread support (needed for ibcm/rdmacm)... posix
>> checking which openib btl cpcs will be built... oob rdmacm
>> checking if MCA component btl:openib can compile... yes
>>
>>
>> Compilers: gcc 4.1.2 and pgcc 8.0-4 same problems, optimization level
>> does not matter. (-fast, -O3 or -O0) (64 bit)
>>
>> CPU: opteron 250
>> OS: Scientific linux 5.2
>>
>> If you require any more information, I'll be more than happy to provide
>> it!
>>
>> Is this a proper way to overlap communication with calculation? Could
>> this be some kind of cache-coherency problem? values in cpu cache
>> already but rdma puts things in memory, although in that case I would
>> expect the sum not to be that off? What would happen if the compiler
>> decided to do non-temporal prefetches (or stores in the general case)?
>>
>>
>>
>> The code:
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <mpi.h>
>>
>>
>> int main(int argc, char **argv)
>> {
>> int rank,size,i,j,k;
>> const int arrlen=10;
>> const int repeattest=100;
>> double *array;
>> MPI_Request *reqarr;
>> MPI_Status *mpistat;
>> MPI_Datatype STRIDED;
>> int torank,fromrank,nreq;
>> int sumshouldbe;
>> MPI_Init(&argc,&argv);
>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>> MPI_Comm_size(MPI_COMM_WORLD,&size);
>>
>> /* Non-contiguous data */
>> MPI_Type_vector(arrlen,1,size,MPI_DOUBLE,&STRIDED);
>> MPI_Type_commit(&STRIDED);
>>
>> array=malloc(arrlen*size *sizeof *array);
>> reqarr=malloc(2*size*sizeof *reqarr);
>> mpistat=malloc(2*size*sizeof *mpistat);
>>
>> /* Setup communication */
>> sumshouldbe=0;
>> nreq=0;
>> for (i=1; i<size; i++)
>> {
>> torank=rank+i;
>> if (torank>=size)
>> torank-=size;
>> fromrank=rank-i;
>> if (fromrank<0)
>> fromrank+=size;
>> MPI_Recv_init(array+i,1,STRIDED,fromrank,i,MPI_COMM_WORLD,reqarr+nreq);
>> nreq++;
>> MPI_Send_init(array,1,STRIDED,torank,i,MPI_COMM_WORLD,reqarr+nreq);
>> nreq++;
>> sumshouldbe+=i;
>> }
>> printf("Sum should be %g\n",(double)arrlen*sumshouldbe);
>> /* Do the tests. */
>> for (j=0; j<repeattest; j++)
>> {
>> double sum=0.;
>> /* Init test arrays. Array on first process is initially all
>> zero. On second process all one, etc. Same as rank number. */
>> for (i=0; i<arrlen*size; i++)
>> array[i]=(double)rank;
>>
>> /* Start communication */
>> MPI_Startall(nreq,reqarr);
>>
>> /* Accumulate part of arrays that are not communicated. This
>> touches the parts of the arrays that are *not*
>> communicated!! */
>> for (i=0; i<arrlen; i++)
>> sum+=array[i*size];
>>
>> /* Wait for communication to finish */
>> MPI_Waitall(nreq,reqarr,mpistat);
>>
>> /* Accumulate part of arrays that have been communicated. */
>> for (i=0; i<arrlen*(size-1); i++)
>> {
>> for (k=0; k<size-1; k++)
>> sum+=array[i*size+1+k];
>> }
>>
>> if (sum!=arrlen*sumshouldbe)
>> printf("Result on rank %d strangely is %g\n",rank,sum);
>> }
>>
>> MPI_Finalize();
>> return 0;
>> }
>>
>>
>>
>>
>>
>
>
>

-- 
Daniel Spångberg
Materialkemi
Uppsala Universitet