I continue to have a problem where 2 processes are sending to the same process and one of the sending processes hangs for 150 to 550 ms in the call to MPI_Send.

 

Each process runs on a different node and the receiving process has posted an MPI_Irecv 17 ms before the hanging send.

The posted receives are for 172K buffers and the sending processes are sending 81K size messages.

I have set mpi_leave_pinned to 1 and have increased the btl_openib_receive_queues to …:S,65536,512,256,64

 

How do I trace the various phases of message passing to diagnose where the send is hanging up?