I am trying to play with nvidia's gpudirect. The test program given with
the gpudirect tarball just does a basic MPI ping-pong between two
process that allocated their buffers with cudaHostMalloc instead of
malloc. It seems to work with Intel MPI but Open MPI 1.5 hangs in the
first MPI_Send. Replacing the cuda buffer with a normally-malloc'ed
buffer makes the program work again. I assume that something goes wrong
when OMPI tries to register/pin the cuda buffer in the IB stack (that's
what gpudirect seems to be about), but I don't see why Intel MPI would
Has anybody ever looked at this?
FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 126.96.36.199 and SLES11 w/
and w/o the gpudirect patch.