Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Stream interactions in CUDA
From: Dmitry N. Mikushin (maemarcus_at_[hidden])
Date: 2012-12-12 19:59:05


Hi Justin,

Quick grepping reveals several cuMemcpy calls in OpenMPI. Some of them are
even synchronous, meaning stream0.

I think the best way of exploring this sort of behavior is to execute
OpenMPI runtime (thanks to its open-source nature!) under debugger. Rebuild
OpenMPI with -g -O0, add some initial sleep() into your app, such that this
time would be sufficient to gdb-attach to one of MPI processes. Once
attached, first put break on the beginning of your region of interest and
then break on cuMemcpy and cuMemcpyAsync.

Best,
- D.

2012/12/13 Justin Luitjens <jluitjens_at_[hidden]>

> Hello,
>
> I'm working on an application using OpenMPI with CUDA and GPUDirect. I
> would like to get the MPI transfers to overlap with computation on the CUDA
> device. To do this I need to ensure that all memory transfers do not go to
> stream 0. In this application I have one step that performs an
> MPI_Alltoall operation. Ideally I would like this Alltoall operation to be
> asynchronous. Thus I have implemented my own Alltoall using Isend and
> Irecv. Which can be found at the bottom of this email.
>
> The profiler shows that this operation has some very odd PCI-E traffic
> that I was hoping someone could explain and help me eliminate. In this
> example NPES=2 and each process has its own M2090 GPU. I am using cuda 5.0
> and OpenMPI-1.7rc5. The behavior I am seeing is the following. Once the
> Isend loop occurs there is a sequence of DtoH followed by HtoD transfers.
> These transfers are 256K in size and there are 28 of them that occur.
> Each of these transfers are placed in stream0. After this there are a few
> more small transfers also placed in stream0. Finally when the 3rd loop
> occurs there are 2 DtoD transfers (this is the actual data being exchanged).
>
> Can anyone explain what all of the traffic ping-ponging back and forth
> between the host and device is? Is this traffic necessary?
>
> Thanks,
> Justin
>
>
> uint64_t scatter_gather( uint128 * input_buffer, uint128 *output_buffer,
> uint128 *recv_buckets, int* send_sizes, int MAX_RECV_SIZE_PER_PE) {
>
> std::vector<MPI_Request> srequest(NPES), rrequest(NPES);
>
> //Start receives
> for(int p=0;p<NPES;p++) {
>
> MPI_Irecv(recv_buckets+MAX_RECV_SIZE_PER_PE*p,MAX_RECV_SIZE_PER_PE,MPI_INT_128,p,0,MPI_COMM_WORLD,&rrequest[p]);
> }
>
> //Start sends
> int send_count=0;
> for(int p=0;p<NPES;p++) {
>
> MPI_Isend(input_buffer+send_count,send_sizes[p],MPI_INT_128,p,0,MPI_COMM_WORLD,&srequest[p]);
> send_count+=send_sizes[p];
> }
>
> //Process outstanding receives
> int recv_count=0;
> for(int p=0;p<NPES;p++) {
> MPI_Status status;
> MPI_Wait(&rrequest[p],&status);
> int count;
> MPI_Get_count(&status,MPI_INT_128,&count);
> assert(count<MAX_RECV_SIZE_PER_PE);
>
> cudaMemcpy(output_buffer+recv_count,recv_buckets+MAX_RECV_SIZE_PER_PE*p,count*sizeof(uint128),cudaMemcpyDeviceToDevice);
> recv_count+=count;
> }
>
> //Wait for outstanding sends
> for(int p=0;p<NPES;p++) {
> MPI_Status status;
> MPI_Wait(&srequest[p],&status);
> }
> return recv_count;
> }
>
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information. Any unauthorized review, use, disclosure or
> distribution
> is prohibited. If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> -----------------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>