Are the recent peer to peer capabilities of cuda leveraged by Open MPI
when eg you're running a rank per gpu on the one workstation?
It seems in my testing that I only get in the order of about 1GB/s as
whereas nvidia's simpleP2P test indicates ~6 GB/s.
Also, I ran into a problem just trying to test. It seems you have to
do cudaSetDevice/cuCtxCreate with the appropriate gpu id which I was
wanting to derive from the rank. You don't however know the rank
until after MPI_Init() and you need to initialize cuda before. Not
sure if there's a standard way to do it? I have a workaround atm.