Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with overlapping communication with calculation
From: Daniel Spångberg (daniels_at_[hidden])
Date: 2009-03-25 09:52:16


Dear list,

A colleague pointed out an error in my test code. The final loop should
not be
  for (i=0; i<arrlen*(size-1); i++)
but rather
  for (i=0; i<arrlen; i++)

details, details... Anyway, I still get problems from time to time with
this test code, but I have not yet had time to figure out the
circumstances when this happens. I will report back to this list once I
know what's going on.

Sorry to trouble you too early!

Daniel Spångberg

Den 2009-03-25 09:44:37 skrev Daniel Spångberg <daniels_at_[hidden]>:

> Dear list,
>
> We've found a problem with openmpi when running over IB when calculation
> reading elements of an array is overlapping communication to other
> elements (that are not used in the calculation) of the same array. I
> have written a small test program (below) that shows this behaviour.
> When the array is small (arrlen in the code), more problems occur. The
> problems only occur when using IB (even on the same node!?), using
> mpirun -mca btl tcp,self the problem vanishes.
>
> The behaviour with 1.2.9 and 1.3.1 is slightly different, where problems
> occur already for 3 processes with openmpi 1.2.9 but 4 processes are
> required for problems with 1.3.1. Proper output on 4 processes should
> just be:
> Sum should be 60
> Sum should be 60
> Sum should be 60
> Sum should be 60
>
> With IB:
> mpirun -np 4 ./test3|head
> Sum should be 60
> Sum should be 60
> Sum should be 60
> Sum should be 60
> Result on rank 0 strangely is 1.06316e+248
> Result on rank 2 strangely is 1.54396e+262
> Result on rank 3 strangely is 3.87325e+233
> Result on rank 1 strangely is 1.54396e+262
> Result on rank 1 strangely is 1.54396e+262
> Result on rank 2 strangely is 1.54396e+262
>
>
> Info about the system:
>
> openmpi: 1.2.9, 1.3.1
>
> From ompi_info:
> MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.1)
>
> From lspci:
> 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
>
> configure picks up ibverbs:
> --- MCA component btl:ofud (m4 configuration macro)
> checking for MCA component btl:ofud compile mode... dso
> checking --with-openib value... simple ok (unspecified)
> checking --with-openib-libdir value... simple ok (unspecified)
> checking for fcntl.h... (cached) yes
> checking sys/poll.h usability... yes
> checking sys/poll.h presence... yes
> checking for sys/poll.h... yes
> checking infiniband/verbs.h usability... yes
> checking infiniband/verbs.h presence... yes
> checking for infiniband/verbs.h... yes
> looking for library without search path
> checking for ibv_open_device in -libverbs... yes
> checking number of arguments to ibv_create_cq... 5
> checking whether IBV_EVENT_CLIENT_REREGISTER is declared... yes
> checking for ibv_get_device_list... yes
> checking for ibv_resize_cq... yes
> checking for struct ibv_device.transport_type... yes
> checking for ibv_create_xrc_rcv_qp... no
> checking rdma/rdma_cma.h usability... yes
> checking rdma/rdma_cma.h presence... yes
> checking for rdma/rdma_cma.h... yes
> checking for rdma_create_id in -lrdmacm... yes
> checking for rdma_get_peer_addr... yes
> checking for infiniband/driver.h... yes
> checking if ConnectX XRC support is enabled... no
> checking if OpenFabrics RDMACM support is enabled... yes
> checking if OpenFabrics IBCM support is enabled... no
> checking if MCA component btl:ofud can compile... yes
>
> --- MCA component btl:openib (m4 configuration macro)
> checking for MCA component btl:openib compile mode... dso
> checking --with-openib value... simple ok (unspecified)
> checking --with-openib-libdir value... simple ok (unspecified)
> checking for fcntl.h... (cached) yes
> checking for sys/poll.h... (cached) yes
> checking infiniband/verbs.h usability... yes
> checking infiniband/verbs.h presence... yes
> checking for infiniband/verbs.h... yes
> looking for library without search path
> checking for ibv_open_device in -libverbs... yes
> checking number of arguments to ibv_create_cq... (cached) 5
> checking whether IBV_EVENT_CLIENT_REREGISTER is declared... (cached) yes
> checking for ibv_get_device_list... (cached) yes
> checking for ibv_resize_cq... (cached) yes
> checking for struct ibv_device.transport_type... (cached) yes
> checking for ibv_create_xrc_rcv_qp... (cached) no
> checking for rdma/rdma_cma.h... (cached) yes
> checking for rdma_create_id in -lrdmacm... (cached) yes
> checking for rdma_get_peer_addr... yes
> checking for infiniband/driver.h... (cached) yes
> checking if ConnectX XRC support is enabled... no
> checking if OpenFabrics RDMACM support is enabled... yes
> checking if OpenFabrics IBCM support is enabled... no
> checking for ibv_fork_init... yes
> checking for thread support (needed for ibcm/rdmacm)... posix
> checking which openib btl cpcs will be built... oob rdmacm
> checking if MCA component btl:openib can compile... yes
>
>
> Compilers: gcc 4.1.2 and pgcc 8.0-4 same problems, optimization level
> does not matter. (-fast, -O3 or -O0) (64 bit)
>
> CPU: opteron 250
> OS: Scientific linux 5.2
>
> If you require any more information, I'll be more than happy to provide
> it!
>
> Is this a proper way to overlap communication with calculation? Could
> this be some kind of cache-coherency problem? values in cpu cache
> already but rdma puts things in memory, although in that case I would
> expect the sum not to be that off? What would happen if the compiler
> decided to do non-temporal prefetches (or stores in the general case)?
>
>
>
> The code:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <mpi.h>
>
>
> int main(int argc, char **argv)
> {
> int rank,size,i,j,k;
> const int arrlen=10;
> const int repeattest=100;
> double *array;
> MPI_Request *reqarr;
> MPI_Status *mpistat;
> MPI_Datatype STRIDED;
> int torank,fromrank,nreq;
> int sumshouldbe;
> MPI_Init(&argc,&argv);
> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
> MPI_Comm_size(MPI_COMM_WORLD,&size);
>
> /* Non-contiguous data */
> MPI_Type_vector(arrlen,1,size,MPI_DOUBLE,&STRIDED);
> MPI_Type_commit(&STRIDED);
>
> array=malloc(arrlen*size *sizeof *array);
> reqarr=malloc(2*size*sizeof *reqarr);
> mpistat=malloc(2*size*sizeof *mpistat);
>
> /* Setup communication */
> sumshouldbe=0;
> nreq=0;
> for (i=1; i<size; i++)
> {
> torank=rank+i;
> if (torank>=size)
> torank-=size;
> fromrank=rank-i;
> if (fromrank<0)
> fromrank+=size;
> MPI_Recv_init(array+i,1,STRIDED,fromrank,i,MPI_COMM_WORLD,reqarr+nreq);
> nreq++;
> MPI_Send_init(array,1,STRIDED,torank,i,MPI_COMM_WORLD,reqarr+nreq);
> nreq++;
> sumshouldbe+=i;
> }
> printf("Sum should be %g\n",(double)arrlen*sumshouldbe);
> /* Do the tests. */
> for (j=0; j<repeattest; j++)
> {
> double sum=0.;
> /* Init test arrays. Array on first process is initially all
> zero. On second process all one, etc. Same as rank number. */
> for (i=0; i<arrlen*size; i++)
> array[i]=(double)rank;
>
> /* Start communication */
> MPI_Startall(nreq,reqarr);
>
> /* Accumulate part of arrays that are not communicated. This
> touches the parts of the arrays that are *not*
> communicated!! */
> for (i=0; i<arrlen; i++)
> sum+=array[i*size];
>
> /* Wait for communication to finish */
> MPI_Waitall(nreq,reqarr,mpistat);
>
> /* Accumulate part of arrays that have been communicated. */
> for (i=0; i<arrlen*(size-1); i++)
> {
> for (k=0; k<size-1; k++)
> sum+=array[i*size+1+k];
> }
>
> if (sum!=arrlen*sumshouldbe)
> printf("Result on rank %d strangely is %g\n",rank,sum);
> }
>
> MPI_Finalize();
> return 0;
> }
>
>
>
>
>

-- 
Daniel Spångberg
Materialkemi
Uppsala Universitet