[I sent this out in June, but did not commit it. So resending. Timeout of Jan 5, 2012. Note that this does not use the GPU Direct RDMA]
WHAT: Add support for doing asynchronous copies of GPU memory with larger messages.
WHY: Improve performance for sending/receiving of larger GPU messages over IB
WHERE: ob1, openib, and convertor code. All is protected by compiler directives
so no effect on non-CUDA builds.
REFERENCE BRANCH: https://bitbucket.org/rolfv/ompi-trunk-cuda-async-2
When sending/receiving GPU memory through IB, all data first passes into host memory.
The copy of GPU memory into and out of the host memory can be done asynchronously
to improve performance. This RFC adds that feature for the fragments of larger messages.
On the sending side, the completion function is essentially broken in two. The first function
is called when the copy completes which then initiates the send. When the send completes,
the second function is called.
Likewise, on the receiving side, a callback is called when the fragment arrives which
initiates the copy of the data out of the buffer. When the copy completes, a second
function is called which also calls back into the BTL so it can free resources that
were being used.