WHAT: Second try to add support to send data directly from CUDA device memory via MPI calls.
DETAILS: Based on all the feedback (thanks to everyone who looked at it), I have whittled down what I hope to accomplish with this RFC. There were suggestions to better modularize the CUDA registration code so I will take a look at that separately. Since the registration code is a performance feature, it will be dropped from this RFC and investigated separately. This significantly reduced the changes being proposed here. With this RFC, all the changes are isolated in datatype and convertor code. As mentioned before, the changes mostly boil down to replacing memcpy with cuMemcpy when moving the data to or from a CUDA device buffer.
Per suggestions, the choice to disable the large memory RDMA now happens on a per message basis. This is done by adding a flag to the convertor which tells the BTLs that an intermediate buffer is needed when dealing with device memory.
As before, this code would be enabled via a configure option. A mostly completed version is viewable on bitbucket although I know the configure code is sorely lacking.
This is the new list of changed files.