Thank you! I saw your previous discussion and actually have tried "--mca btl_openib_flags 304".
It didn't solve the problem unfortunately. In our case, the MPI buffer is different from the cudaMemcpy
buffer and we do manually copy between them. I'm still trying to figure out how to configure OpenMPI's mca
parameters to solve the problem...
On Jun 5, 2011, at 2:20 AM, Brice Goglin wrote:
> Le 05/06/2011 00:15, Fengguang Song a écrit :
>> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My program uses MPI to exchange data
>> between nodes, and uses cudaMemcpyAsync to exchange data between Host and GPU devices within a node.
>> When the MPI message size is less than 1MB, everything works fine. However, when the message size
>> is > 1MB, the program hangs (i.e., an MPI send never reaches its destination based on my trace).
>> The issue may be related to locked-memory contention between OpenMPI and CUDA.
>> Does anyone have the experience to solve the problem? Which MCA parameters should I tune to increase
>> the message size to be > 1MB (to avoid the program hang)? Any help would be appreciated.
> I may have seen the same problem when testing GPU direct. Do you use the
> same host buffer for copying from/to GPU and for sending/receiving on
> the network ? If so, you need a GPUDirect enabled kernel and mellanox
> drivers, but it only helps before 1MB.
> You can work around the problem with one of the following solution:
> * add --mca btl_openib_flags 304 to force OMPI to always send/recv
> through an intermediate (internal buffer), but it'll decrease
> performance before 1MB too
> * use different host buffers for the GPU and the network and manually
> copy between them
> I never got any reply from NVIDIA/Mellanox/here when I reported this
> problem with GPUDirect and messages larger than 1MB.
> users mailing list