Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] GPUDirect v1 issues
From: Sebastian Rinke (s.rinke_at_[hidden])
Date: 2012-01-20 12:20:39


With

* MLNX OFED stack tailored for GPUDirect
* RHEL + kernel patch
* MVAPICH2

it is possible to monitor GPUDirect v1 activities by means of observing changes to values in

* /sys/module/ib_core/parameters/gpu_direct_pages
* /sys/module/ib_core/parameters/gpu_direct_shares

By setting CUDA_NIC_INTEROP=1 there are no changes anymore.

Is there a different way now to monitor if GPUDirect actually works?

Sebastian.

On Jan 18, 2012, at 5:06 PM, Kenneth Lloyd wrote:

> It is documented in http://developer.download.nvidia.com/compute/cuda/4_0/docs/GPUDirect_Technology_Overview.pdf
> set CUDA_NIC_INTEROP=1
>
>
> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]] On Behalf Of Sebastian Rinke
> Sent: Wednesday, January 18, 2012 8:15 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>
> Setting the environment variable fixed the problem for Open MPI with CUDA 4.0. Thanks!
>
> However, I'm wondering why this is not documented in the NVIDIA GPUDirect package.
>
> Sebastian.
>
> On Jan 18, 2012, at 1:28 AM, Rolf vandeVaart wrote:
>
>
> Yes, the step outlined in your second bullet is no longer necessary.
>
> Rolf
>
>
> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]] On Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 5:22 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>
> Thank you very much. I will try setting the environment variable and if required also use the 4.1 RC2 version.
>
> To clarify things a little bit for me, to set up my machine with GPUDirect v1 I did the following:
>
> * Install RHEL 5.4
> * Use the kernel with GPUDirect support
> * Use the MLNX OFED stack with GPUDirect support
> * Install the CUDA developer driver
>
> Does using CUDA >= 4.0 make one of the above steps redundant?
>
> I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is not needed any more?
>
> Sebastian.
>
> Rolf vandeVaart wrote:
> I ran your test case against Open MPI 1.4.2 and CUDA 4.1 RC2 and it worked fine. I do not have a machine right now where I can load CUDA 4.0 drivers.
> Any chance you can try CUDA 4.1 RC2? There were some improvements in the support (you do not need to set an environment variable for one)
> http://developer.nvidia.com/cuda-toolkit-41
>
> There is also a chance that setting the environment variable as outlined in this link may help you.
> http://forums.nvidia.com/index.php?showtopic=200629
>
> However, I cannot explain why MVAPICH would work and Open MPI would not.
>
> Rolf
>
>
> -----Original Message-----
> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]]
> On Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 12:08 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>
> I use CUDA 4.0 with MVAPICH2 1.5.1p1 and Open MPI 1.4.2.
>
> Attached you find a little test case which is based on the GPUDirect v1 test
> case (mpi_pinned.c).
> In that program the sender splits a message into chunks and sends them
> separately to the receiver which posts the corresponding recvs. It is a kind of
> pipelining.
>
> In mpi_pinned.c:141 the offsets into the recv buffer are set.
> For the correct offsets, i.e. increasing them, it blocks with Open MPI.
>
> Using line 142 instead (offset = 0) works.
>
> The tarball attached contains a Makefile where you will have to adjust
>
> * CUDA_INC_DIR
> * CUDA_LIB_DIR
>
> Sebastian
>
> On Jan 17, 2012, at 4:16 PM, Kenneth A. Lloyd wrote:
>
>
> Also, which version of MVAPICH2 did you use?
>
> I've been pouring over Rolf's OpenMPI CUDA RDMA 3 (using CUDA 4.1 r2)
> vis MVAPICH-GPU on a small 3 node cluster. These are wickedly interesting.
>
> Ken
> -----Original Message-----
> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_open-
>
> mpi.org]
>
> On Behalf Of Rolf vandeVaart
> Sent: Tuesday, January 17, 2012 7:54 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>
> I am not aware of any issues. Can you send me a test program and I
> can try it out?
> Which version of CUDA are you using?
>
> Rolf
>
>
> -----Original Message-----
> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_open-
>
> mpi.org]
>
> On Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 8:50 AM
> To: Open MPI Developers
> Subject: [OMPI devel] GPUDirect v1 issues
>
> Dear all,
>
> I'm using GPUDirect v1 with Open MPI 1.4.3 and experience blocking
> MPI_SEND/RECV to block forever.
>
> For two subsequent MPI_RECV, it hangs if the recv buffer pointer of
> the second recv points to somewhere, i.e. not at the beginning, in
> the recv buffer (previously allocated with cudaMallocHost()).
>
> I tried the same with MVAPICH2 and did not see the problem.
>
> Does anybody know about issues with GPUDirect v1 using Open MPI?
>
> Thanks for your help,
> Sebastian
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information. Any unauthorized review, use, disclosure or distribution
> is prohibited. If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel