Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] bug in CUDA support for dual-processor systems?
From: Dmitry N. Mikushin (maemarcus_at_[hidden])
Date: 2012-08-02 01:09:29


Hi Zbigniew,

> a) I noticed that on my 6-GPU 2-CPU platform the initialization of CUDA 4.2 takes a looooong time, approx 10 seconds.
> Do you think I should report this as a bug to nVidia?

This is an expected time for creation of driver contexts on so many
devices. I'm sure NVIDIA already got thousands of reports on this :)
The typical answer is: keep alive context on GPU either by running an
X server or by executing "nvidia-smi -l 1" in background. With one of
these init time should drop down to ~1 sec or less.

- D.

2012/7/31 Zbigniew Koza <zzkoza_at_[hidden]>:
> Thanks for a quick reply.
>
> I do not know much about low-level CUDA and IPC,
> but there's no problem using high-level CUDA to determine if
> device A can talk to B via GPUDirect (cudaDeviceCanAccessPeer).
> Then, for such connections, one only needs to call
> cudaDeviceEnablePeerAccess
> and then essentially "sit back and laugh" - given correct current device
> and stream, functions like cudaMemcpyPeer work irrespectively of whether
> GPUDirect
> is on or off for a given pair of devices, the only difference being the
> speed.
> So, I hope it should be possible to implement device-IOH-IOH-device
> communication using low-level CUDA.
> Such functionality should be an important step in the "CPU-GPU
> high-performance war" :-),
> as 8-GPU fast-MPI-link systems bring a new meaning to a "GPU node" in GPU
> clusters...
>
> Here is the output of my test program that was aimed at determining
> a) aggregate, best-case transfer rate between 6 GPUs running in parallel and
> b) whether devices on different IOHs can talk to each other:
>
> 3 [GB] in 78.6952 [ms] = 38.1218 GB/s (aggregate)
> sending 600000000 bytes from device 0:
> 0 -> 0: 11.3454 [ms] 52.8848 GB/s
> 0 -> 1: 90.3628 [ms] 6.6399 GB/s
> 0 -> 2: 113.396 [ms] 5.29117 GB/s
> 0 -> 3: 113.415 [ms] 5.29032 GB/s
> 0 -> 4: 170.307 [ms] 3.52305 GB/s
> 0 -> 5: 169.613 [ms] 3.53747 GB/s
>
> This shows that even if devices are on different IOHs, like 0 and 4, they
> can talk to each other at a fantastic speed of 3.5 GB/s
> and it would be pity if OpenMPI did not used this opportunity.
>
> I have also 2 questions:
>
> a) I noticed that on my 6-GPU 2-CPU platform the initialization of CUDA 4.2
> takes a looooong time, approx 10 seconds.
> Do you think I should report this as a bug to nVidia?
>
> b) Is there any info on running OpenMPI + CUDA? For example, what are the
> dependencies of transfer rates and latencies on transfer size?
> A dedicated www page, blog or whatever? How can I know if the current
> problem was solved?
>
>
>
> Many thanks for making CUDA available in OpenMPI.
>
> Regards
>
> Z Koza
>
> W dniu 31.07.2012 19:39, Rolf vandeVaart pisze:
>
>> The current implementation does assume that the GPUs are on the same IOH
>> and therefore can use the IPC features of the CUDA library for
>> communication.
>> One of the initial motivations for this was that to be able to detect
>> whether GPUs can talk to one another, the CUDA library has to be initialized
>> and the GPUs have to be selected by each rank. It is at that point that we
>> can determine whether the IPC will work between the GPUs. However, this
>> means that the GPUs need to be selected by each rank prior to the call to
>> MPI_Init as that is where we determine whether IPC is possible, and we were
>> trying to avoid that requirement.
>>
>> I will submit a ticket against this and see if we can improve this.
>>
>> Rolf
>>
>>> -----Original Message-----
>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>> On Behalf Of Zbigniew Koza
>>> Sent: Tuesday, July 31, 2012 12:38 PM
>>> To: users_at_[hidden]
>>> Subject: [OMPI users] bug in CUDA support for dual-processor systems?
>>>
>>> Hi,
>>>
>>> I wrote a simple program to see if OpenMPI can really handle cuda
>>> pointers as
>>> promised in the FAQ and how efficiently.
>>> The program (see below) breaks if MPI communication is to be performed
>>> between two devices that are on the same node but under different IOHs in
>>> a
>>> dual-processor Intel machine.
>>> Note that cudaMemCpy works for such devices, although not as efficiently
>>> as
>>> for the devices on the same IOH and GPUDirect enabled.
>>>
>>> Here's the output from my program:
>>>
>>> ===============================
>>>
>>>> mpirun -n 6 ./a.out
>>>
>>> Init
>>> Init
>>> Init
>>> Init
>>> Init
>>> Init
>>> rank: 1, size: 6
>>> rank: 2, size: 6
>>> rank: 3, size: 6
>>> rank: 4, size: 6
>>> rank: 5, size: 6
>>> rank: 0, size: 6
>>> device 3 is set
>>> Process 3 is on typhoon1
>>> Using regular memory
>>> device 0 is set
>>> Process 0 is on typhoon1
>>> Using regular memory
>>> device 4 is set
>>> Process 4 is on typhoon1
>>> Using regular memory
>>> device 1 is set
>>> Process 1 is on typhoon1
>>> Using regular memory
>>> device 5 is set
>>> Process 5 is on typhoon1
>>> Using regular memory
>>> device 2 is set
>>> Process 2 is on typhoon1
>>> Using regular memory
>>> ^C^[[A^C
>>> zkoza_at_typhoon1:~/multigpu$
>>> zkoza_at_typhoon1:~/multigpu$ vim cudamussings.c
>>> zkoza_at_typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
>>> -L/usr/local/cuda/lib64 -I/usr/local/cuda/include
>>> zkoza_at_typhoon1:~/multigpu$ vim cudamussings.c
>>> zkoza_at_typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
>>> -L/usr/local/cuda/lib64 -I/usr/local/cuda/include
>>> zkoza_at_typhoon1:~/multigpu$ mpirun -n 6 ./a.out Process 1 of 6 is on
>>> typhoon1 Process 2 of 6 is on typhoon1 Process 0 of 6 is on typhoon1
>>> Process
>>> 4 of 6 is on typhoon1 Process 5 of 6 is on typhoon1 Process 3 of 6 is on
>>> typhoon1 device 2 is set device 1 is set device 0 is set Using regular
>>> memory
>>> device 5 is set device 3 is set device 4 is set
>>> Host->device bandwidth for processor 1: 1587.993499 MB/sec device
>>> Host->bandwidth for processor 2: 1570.275316 MB/sec device bandwidth for
>>> Host->processor 3: 1569.890751 MB/sec device bandwidth for processor 5:
>>> Host->1483.637702 MB/sec device bandwidth for processor 0: 1480.888029
>>> Host->MB/sec device bandwidth for processor 4: 1476.241371 MB/sec
>>> MPI_Send/MPI_Receive, Host [0] -> Host [1] bandwidth: 3338.57 MB/sec
>>> MPI_Send/MPI_Receive, Device[0] -> Host [1] bandwidth: 420.85 MB/sec
>>> MPI_Send/MPI_Receive, Host [0] -> Device[1] bandwidth: 362.13 MB/sec
>>> MPI_Send/MPI_Receive, Device[0] -> Device[1] bandwidth: 6552.35 MB/sec
>>> MPI_Send/MPI_Receive, Host [0] -> Host [2] bandwidth: 3238.88 MB/sec
>>> MPI_Send/MPI_Receive, Device[0] -> Host [2] bandwidth: 418.18 MB/sec
>>> MPI_Send/MPI_Receive, Host [0] -> Device[2] bandwidth: 362.06 MB/sec
>>> MPI_Send/MPI_Receive, Device[0] -> Device[2] bandwidth: 5022.82 MB/sec
>>> MPI_Send/MPI_Receive, Host [0] -> Host [3] bandwidth: 3295.32 MB/sec
>>> MPI_Send/MPI_Receive, Device[0] -> Host [3] bandwidth: 418.90 MB/sec
>>> MPI_Send/MPI_Receive, Host [0] -> Device[3] bandwidth: 359.16 MB/sec
>>> MPI_Send/MPI_Receive, Device[0] -> Device[3] bandwidth: 5019.89 MB/sec
>>> MPI_Send/MPI_Receive, Host [0] -> Host [4] bandwidth: 4619.55 MB/sec
>>> MPI_Send/MPI_Receive, Device[0] -> Host [4] bandwidth: 419.24 MB/sec
>>> MPI_Send/MPI_Receive, Host [0] -> Device[4] bandwidth: 364.52 MB/sec
>>>
>>> --------------------------------------------------------------------------
>>> The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>>> will cause the program to abort.
>>> cuIpcOpenMemHandle return value: 205
>>> address: 0x200200000
>>> Check the cuda.h file for what the return value means. Perhaps a reboot
>>> of
>>> the node will clear the problem.
>>>
>>> --------------------------------------------------------------------------
>>> [typhoon1:06098] Failed to register remote memory, rc=-1 [typhoon1:06098]
>>> [[33788,1],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 465
>>>
>>> ========================================================
>>>
>>>
>>>
>>> Comment:
>>> In my machine there are 2 six-core intel processors with HT on, yielding
>>> 24 virtual processors, and 6 Tesla C2070s.
>>> The devices are grouped in two groups, one with 4 and the other with 2
>>> devices.
>>> Devices in the same group can talk to each other via GPUDirect at approx
>>> 6GB/s; devices in different groups can use cudaMemCpy and UVA at
>>> somewhat smaller transfer rates.
>>>
>>>
>>> my OpenMPI is openmpi-1.9a1r26904 compiled from sources
>>>
>>> ./configure -prefix=/home/zkoza/openmpi.1.9.cuda
>>> --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/lib
>>>
>>>> nvcc -V
>>>
>>> nvcc: NVIDIA (R) Cuda compiler driver
>>> Copyright (c) 2005-2012 NVIDIA Corporation Built on
>>> Thu_Apr__5_00:24:31_PDT_2012 Cuda compilation tools, release 4.2,
>>> V0.2.1221
>>>
>>> gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
>>>
>>> Ubuntu 12.04 64-bit
>>>
>>> Nvidia Driver Version: 295.41 |
>>>
>>> The program was compiled with:
>>>>
>>>> mpicc prog.c -lcuda -lcudart -L/usr/local/cuda/lib64
>>>> -I/usr/local/cuda/include
>>>
>>>
>>>
>>> ================================================
>>> SOURCE CODE:
>>> ================================================
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <cuda.h>
>>> #include <cuda_runtime.h>
>>> #include <sys/time.h>
>>> #include <mpi.h>
>>>
>>> #define NREPEAT 20
>>> #define NBYTES 100000000
>>>
>>>
>>> #define CALL(x)\
>>> {\
>>> cudaError_t err = x;\
>>> if (cudaSuccess != err)\
>>> {\
>>> printf("CUDA ERROR %s at %d\n", cudaGetErrorString(err), __LINE__
>>> ); \
>>> cudaGetLastError();\
>>> }\
>>> }
>>>
>>> int main (int argc, char *argv[])
>>> {
>>> int rank, size, n, len, numbytes;
>>> void *a_h, *a_d;
>>> struct timeval time[2];
>>> double bandwidth;
>>> char name[MPI_MAX_PROCESSOR_NAME];
>>> MPI_Status status;
>>>
>>> MPI_Init (&argc, &argv);
>>> MPI_Comm_rank (MPI_COMM_WORLD, &rank);
>>> MPI_Comm_size (MPI_COMM_WORLD, &size);
>>> MPI_Get_processor_name(name, &len);
>>>
>>> printf("Process %d of %d is on %s\n", rank, size, name);
>>> fflush(stdout);
>>>
>>> CALL( cudaSetDevice(rank) );
>>> printf("device %d is set\n", rank);
>>> fflush(stdout);
>>>
>>> #ifdef PINNED
>>> if (rank == 0)
>>> printf("Using pinned memory \n");
>>> CALL( cudaMallocHost( (void **) &a_h, NBYTES) );
>>> #else
>>> if (rank == 0)
>>> printf("Using regular memory \n");
>>> a_h = malloc(NBYTES);
>>> #endif
>>> CALL( cudaMalloc( (void **) &a_d, NBYTES) );
>>>
>>> MPI_Barrier(MPI_COMM_WORLD);
>>>
>>> gettimeofday(&time[0], NULL);
>>> for (n=0; n<NREPEAT; n++ )
>>> {
>>> CALL( cudaMemcpy(a_d, a_h, NBYTES,
>>> cudaMemcpyHostToDevice) );
>>> }
>>> gettimeofday(&time[1], NULL);
>>>
>>> bandwidth = time[1].tv_sec - time[0].tv_sec;
>>> bandwidth += 1.e-6*(time[1].tv_usec - time[0].tv_usec);
>>> bandwidth = (double)NBYTES*NREPEAT/1.e6/bandwidth;
>>>
>>> printf("Host->device bandwidth for processor %d: %f MB/sec\n",
>>> rank, bandwidth);
>>>
>>> /* Test MPI send/recv bandwidth. */
>>>
>>> MPI_Barrier(MPI_COMM_WORLD);
>>>
>>> int i, proc;
>>> for (proc = 1; proc < size; proc++)
>>> {
>>> for (i = 0; i < 4; i++)
>>> {
>>> const int from_host = (i & 1) == 0;
>>> const int to_host = (i & 2) == 0;
>>> const char* tab[2] = {"Device", "Host "};
>>> void * ptr[2] = {a_d, a_h};;
>>>
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> gettimeofday(&time[0], NULL);
>>> for (n=0; n<NREPEAT; n++)
>>> {
>>> if (rank == 0)
>>> MPI_Send(ptr[from_host],
>>> NBYTES/sizeof(int), MPI_INT, proc, 0, MPI_COMM_WORLD);
>>> else if (rank == proc)
>>> MPI_Recv(ptr[to_host],
>>> NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
>>> }
>>>
>>> gettimeofday(&time[1], NULL);
>>> // printf("MPI status: %d\n", status);
>>>
>>> bandwidth = time[1].tv_sec - time[0].tv_sec;
>>> bandwidth += 1.e-6*(time[1].tv_usec -
>>> time[0].tv_usec);
>>> bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;
>>> if (rank == 0)
>>> {
>>> printf("MPI_Send/MPI_Receive, %s[%d]
>>> -> %s[%d] bandwidth: %4.2f MB/sec\n",
>>> tab[from_host],
>>> 0, tab[to_host], proc, bandwidth);
>>> fflush(stdout);
>>> }
>>> }
>>> }
>>> #ifdef PINNED
>>> CALL( cudaFreeHost(a_h) );
>>> #else
>>> free(a_h);
>>> #endif
>>> CALL( cudaFree(a_d) ) ;
>>>
>>> MPI_Finalize();
>>> return 0;
>>> }
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> -----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information. Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited. If you are not the intended recipient, please contact the
>> sender by
>> reply email and destroy all copies of the original message.
>>
>> -----------------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users