There is a significant improvement in non-blocking MPI calls (over
Infiniband) from version 1.4 to version 1.6.
I am comparing two methods to exchange messages between two nodes. The
message size varies from 1 MB to 1 GB.
The first method is sends using MPI_Isend()and receives using MPI_Irecv().
The same buffers are used repeatedly to exchange messages between two
nodes. The buffers are allocated using malloc(). In the second method, the
buffers are allocated using MPI_Alloc_mem() and the send and receive are
initialized using MPI_Send_init() and MPI_Recv_init(). The sends and recvs
are posted using MPI_Start.
In version 1.4, the first method has a peak bidirectional bandwidth of 5.3
GB/s and the second method has a peak of 6.2 GB/s. In version 1.6, both
methods have peak bandwidth of 6.2 GB/s. The peak bandwidths are pretty
close to the number reported by ib_read_bw or ib_write_bw commands for
1. The first question is as follows: why does version 1.6 do nonblocking
Isend/Irecv better than version 1.4? I would assume that in the second
method, memory is pinned and registered during MPI_Alloc_mem() and the
transfers use RDMA direct.
In the first method, where the buffers are allocated using malloc(), I
would assume that RDMA pipelining is used. I emphasize that the
mpi_leave_pinned parameter has its default value of -1 and is turned off
in all the runs. I would expect some overhead due to registering and
unregistering memory during each Isend/Irecv, even though pipelining tries
to amortize the costs.
The numbers for version 1.4 are in line with this expectation. However, in
version 1.6 there seems to be no overhead at all due to
registering/unregistering memory. What is going on? Do large messages still
use RDMA pipelining? How has the RDMA pipeline been improved?
2. To send and receive a large message, openmpi may choose between RDMA
write and RDMA read. If RDMA pipelining is used, it seems advantageous to
use RDMA write because some fragments use send/recv semantics. If the
memory is registered and the send/recv result in a single RDMA operation,
there seems nothing to choose between the two. Is that correct? If so, does
openmpi use RDMA write or RDMA read?