Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Stream interactions in CUDA
From: Justin Luitjens (jluitjens_at_[hidden])
Date: 2012-12-12 19:30:56


Hello,

I'm working on an application using OpenMPI with CUDA and GPUDirect. I would like to get the MPI transfers to overlap with computation on the CUDA device. To do this I need to ensure that all memory transfers do not go to stream 0. In this application I have one step that performs an MPI_Alltoall operation. Ideally I would like this Alltoall operation to be asynchronous. Thus I have implemented my own Alltoall using Isend and Irecv. Which can be found at the bottom of this email.

The profiler shows that this operation has some very odd PCI-E traffic that I was hoping someone could explain and help me eliminate. In this example NPES=2 and each process has its own M2090 GPU. I am using cuda 5.0 and OpenMPI-1.7rc5. The behavior I am seeing is the following. Once the Isend loop occurs there is a sequence of DtoH followed by HtoD transfers. These transfers are 256K in size and there are 28 of them that occur. Each of these transfers are placed in stream0. After this there are a few more small transfers also placed in stream0. Finally when the 3rd loop occurs there are 2 DtoD transfers (this is the actual data being exchanged).

Can anyone explain what all of the traffic ping-ponging back and forth between the host and device is? Is this traffic necessary?

Thanks,
Justin

uint64_t scatter_gather( uint128 * input_buffer, uint128 *output_buffer, uint128 *recv_buckets, int* send_sizes, int MAX_RECV_SIZE_PER_PE) {

  std::vector<MPI_Request> srequest(NPES), rrequest(NPES);

  //Start receives
  for(int p=0;p<NPES;p++) {
    MPI_Irecv(recv_buckets+MAX_RECV_SIZE_PER_PE*p,MAX_RECV_SIZE_PER_PE,MPI_INT_128,p,0,MPI_COMM_WORLD,&rrequest[p]);
  }

  //Start sends
  int send_count=0;
  for(int p=0;p<NPES;p++) {
    MPI_Isend(input_buffer+send_count,send_sizes[p],MPI_INT_128,p,0,MPI_COMM_WORLD,&srequest[p]);
    send_count+=send_sizes[p];
  }

  //Process outstanding receives
  int recv_count=0;
  for(int p=0;p<NPES;p++) {
    MPI_Status status;
    MPI_Wait(&rrequest[p],&status);
    int count;
    MPI_Get_count(&status,MPI_INT_128,&count);
    assert(count<MAX_RECV_SIZE_PER_PE);
    cudaMemcpy(output_buffer+recv_count,recv_buckets+MAX_RECV_SIZE_PER_PE*p,count*sizeof(uint128),cudaMemcpyDeviceToDevice);
    recv_count+=count;
  }

  //Wait for outstanding sends
  for(int p=0;p<NPES;p++) {
    MPI_Status status;
    MPI_Wait(&srequest[p],&status);
  }
  return recv_count;
}

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------