I would recommend reading the following tech report, it should shed
some light to how these things work :
> 1 - It does not seem that mvapich does RDMA for small messages. It will
> do RDMA for any message that is too big to send eagerly, but the
> threshold is not that low and cannot be lowered to apply to 0-byte msgs
> anyway (nothing lower than 128bytes or so will work).
mvapich does do RDMA for small messages, they preallocate a buffer for
each peer and then poll each of these buffers for completion,
Take a look at the paper: High Performance RDMA-Based MPI
Implementations over Infiniband by Jiuxing Liu,
Also try compiling mvapich without: -D RDMA_FAST_PATH, I am pretty sure
this is the flag that tells mvapich to compile with small message RDMA.
Removing this flag will force mvapich to use send/recv
> 2 - I do not see that there is any raw performance benefit in insisting
> on doing rdma for small messages anyway, so it does not seem to be a
> tradeoff between scalability and optimal latency. In fact, if I force
> ompi or mvapich to go rdma for smaller messages (at least as far as it
> seems it will go) the latency for these sizes will actually go up,
> does not hurt my intuition. In mvapich I saw an incompressible 13 us
> penalty for doing RDMA.
What you are seeing is a general RDMA protocol which requires that the
initiator obtain the targets memory address and r-key prior to the rdma
operation, additionally the initiator must inform the target of
completion of the RDMA operation. This requires the overhead of control
messages using either send/receive or small message RDMA.
> So far, the best latency I got from ompi is 5.24 us, and the best I
> got from mvapich is 3.15.
> I am perfectly ready to accept that ompi scales better and that may be
> more important (except to the marketing dept :-) ), but I do not
> understand your explanation based on small-message RDMA. Either I
> missunderstood something badly (my best guess), or the 2 us are lost to
> something else than an RDMA-size tradeoff.
Again this is small message RDMA with polling versus send/receive
semantics, we will be adding small message RDMA and should have
performance equal to that of mvapich for small messages, but it is only
relevant for a small working set of peers / micro benchmarks.