Jeff Squyres wrote:
On Nov 29, 2010, at 3:51 AM, Doron Shoham wrote:

  
If only the PUT flag is set and/or the btl supports only PUT method then the sender will allocate a rendezvous header and will not eager send any data. The receiver will schedule rdma PUT(s) of the entire message.
It is done in mca_pml_ob1_recv_request_schedule_once()
(ompi/mca/pml/ob1/pml_ob1_recvreq.c:683).
We can limit the size passing to mca_bml_base_prepare_dst() to be minimum between btl.max_message_size supported by the HCA and the actual message size.
The will result a fragmentation of the RDMA write messages.
    

I would think that we should set btl.max_message_size during init to be the minimum of the MCA param and the max supported by the HCA, right?  Then there's no need for this min() in the critical path.

Additionally, the message must be smaller than the max message size of *both* HCAs, right?  So it might be necessary to add the max message size into the openib BTL modex data so that you can use it in mca_bml_base_prepare_dst() (or whatever -- been a long time since I've mucked around in there...) to compute the min between the two peers.

So you might still need a min, but for a different reason than what you originally mentioned.
  
It is my mistake - the btl.max_message_size is a different parameter. It is more like software limitation rather then hardware limitation from the HCA.
I don't think that in RDMA flow it has any meaning.

Can you please explain a bit more about the openib BTL modex?

  
The bigger problem is when using the GET flow.
In this flow the receiver allocate one big buffer to receive the message with RDMA read in one chunk.
There is no fragmentation mechanism in this flow which make it harder to solve this issue
    

Doh.  I'm afraid I don't know why this was done this way originally...

  
Reading the max message size supported by the HCA can be done by using verbs.
 
The second approach is to use RDMA direct only if the message size is smaller than the max message size supported by the HCA.
 
Here is where the long message protocol is chosen:
ompi/mca/pml/ob1/pml_ob1_sendreq.h line 382.
 
We could use the second approach until a fragmentation mechanism will be added to the RDMA direct GET flow.
    

Are you suggesting that pml_ob1_sendreq.h:382 compare the message length to the btl.max_message_size and choose RDMA direct vs. RDMA pipelined?  If so, that might be sufficient.

But what to do about the peer's max message size?

  

I thought of a different approach:
Instead of limiting the passing to the mca_bml_base_prepare_dst(), we can limit the size in mca_btl_openib_prepare_dst().
I believe this is better solution because it only effects the internal behavior of the openib btl.
In mca_btl_openib_prepare_dst() we have access to both max_msg_sz (local and endpoint).
This will fix the PUT flow.

For the GET flow, maybe we should check in mca_pml_ob1_send_request_start_rdma() -
if the message size is larger then the minimum between both endpoints' max_msg_sz force it use the PUT flow.

The problem is that I'm not sure how to do it without an *ugly hack*.
We need to to access the openib btl parameters from the  mca_pml_ob1_send_request_start_rdma().

The second options it to do it from  pml_ob1_sendreq.h:382, but then again, it requires access to the openib btl parameters...

Any thoughts?

Thanks,
Doron