On Apr 27, 2010, at 10:20 , Sylvain Jeaugey wrote:
> Hi list,
> I'm currently working on IB bandwidth improvements and maybe some of you may help me understanding some things. I'm trying to align every IB RDMA operation to 64 bytes, because having it unaligned can hurt your performance from lightly to very badly, depending on your architecture.
> So, I'm trying to understand the RDMA protocol (PUT and GET), and here is what I understood :
> * if we have one btl, RDMA is performed with only one GET operation, otherwise, we use multiple PUT operations. I can understand that the GET operation improves asynchronous aspects. So, why not always use GET operations ?
Because nobody had the time to implement the pipelined GET protocol.
> * if mpi_leave_pinned is 0, this is becoming more strange. We start a rendez-vous (not RDMA) with a size equal to the eager limit, then we switch to RDMA because the remote peer asks for RDMA PUTs (even if btl_openib_flags does not have the PUT operation btw). Why this corner case ? Why not starting a normal RDMA (especially since we switch back to RDMA afterwards) ?
I guess you just found a bug. In fact the protocol is a little bit more complex: eager, RDMA and send/recv. There is a small amount of data sent over the copy in/copy out at the end of the buffer. Originally this was done on the data right after the eager, but for some "well known" issues on IB (something related to fork, Jeff can give you more details here) we move it at the end.
> * the openib btl has a "buffer alignment" parameter. Fantastic, just what I needed. Unfortunately, I can't see where it is used (and indeed performance is bad if my buffers are not aligned to 64 bytes). Am I missing something ?
No comments ...
> * I did a prototype to split GET operations in openib into two operations : a small one to correct buffer alignment and a big aligned one. It would certainly be better to perform the first one with a normal send/recv, but for the prototype, doing it inside the openib GET was simpler. Performance on unaligned buffers is much better (but this is just a prototype). Is there anyone working on this right now or should I pursue my effort to make it clean and stable ?
This can be easily done internally in the IB BTL, without any support from the upper layer. What I would like to have, is a more generic solution, as I think that all BTL are impacted by the unaligned buffers for RDMA operations. My idea is to modify the way we deal with the eager fragment, and be able to recompute the eager size based on the alignment we want for the next RMA operation. For IB it might be 64 bytes, but for SM it is 4K...
> Thanks in advance for any feedback,
> devel mailing list