George, Yes. GPUDirect eliminated an additional (host) memory buffering step between the HCA and the GPU that took CPU cycles.

I was never very comfortable with the kernel patch necessary, nor the patched OFED required to make it all work.  Having said that, it did provide a ~14% improvement in throughput on some of my code. Not bad.

Now comes GPUDirect 2.0 (mostly helping GPU-GPU across PCIe) and Unified Virtual Addressing. Holds great promise, but the real understanding comes from whitebox analysis, and instrumenting my app code.

> This work does not depend on GPU Direct.  It is making use of the fact that one can malloc memory, register it with IB, and register it with CUDA via the new 4.0 API cuMemHostRegister API.  Then one can copy device memory into this memory.

Wasn't that the point behind GPUDirect? To allow direct memory copy between the GPU and the network card without external intervention?


