George, Yes. GPUDirect eliminated an additional (host) memory buffering step between the HCA and the GPU that took CPU cycles.

I was never very comfortable with the kernel patch necessary, nor the patched OFED required to make it all work.  Having said that, it did provide a ~14% improvement in throughput on some of my code. Not bad.

Now comes GPUDirect 2.0 (mostly helping GPU-GPU across PCIe) and Unified Virtual Addressing. Holds great promise, but the real understanding comes from whitebox analysis, and instrumenting my app code.

On Wed, 2011-04-13 at 17:21 -0400, George Bosilca wrote:
On Apr 13, 2011, at 14:48 , Rolf vandeVaart wrote:

> This work does not depend on GPU Direct.  It is making use of the fact that one can malloc memory, register it with IB, and register it with CUDA via the new 4.0 API cuMemHostRegister API.  Then one can copy device memory into this memory.

Wasn't that the point behind GPUDirect? To allow direct memory copy between the GPU and the network card without external intervention?

  george.



_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

=====================
Kenneth A. Lloyd
CEO - Director of Systems Science
Watt Systems Technologies Inc.
www.wattsys.com
kenneth.lloyd@wattsys.com

This e-mail is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521 and is intended only for the addressee named above. It may contain privileged or confidential information. If you are not the addressee you must not copy, distribute, disclose or use any of the information in it. If you have received it in error please delete it and immediately notify the sender.