Jeff Squyres wrote:
> On Nov 29, 2010, at 3:51 AM, Doron Shoham wrote:
>
>
>> If only the PUT flag is set and/or the btl supports only PUT method then the sender will allocate a rendezvous header and will not eager send any data. The receiver will schedule rdma PUT(s) of the entire message.
>> It is done in mca_pml_ob1_recv_request_schedule_once()
>> (ompi/mca/pml/ob1/pml_ob1_recvreq.c:683).
>> We can limit the size passing to mca_bml_base_prepare_dst() to be minimum between btl.max_message_size supported by the HCA and the actual message size.
>> The will result a fragmentation of the RDMA write messages.
>>
>
> I would think that we should set btl.max_message_size during init to be the minimum of the MCA param and the max supported by the HCA, right? Then there's no need for this min() in the critical path.
>
> Additionally, the message must be smaller than the max message size of *both* HCAs, right? So it might be necessary to add the max message size into the openib BTL modex data so that you can use it in mca_bml_base_prepare_dst() (or whatever -- been a long time since I've mucked around in there...) to compute the min between the two peers.
>
> So you might still need a min, but for a different reason than what you originally mentioned.
>
It is my mistake - the btl.max_message_size is a different parameter. It
is more like software limitation rather then hardware limitation from
the HCA.
I don't think that in RDMA flow it has any meaning.
Can you please explain a bit more about the openib BTL modex?
>
>> The bigger problem is when using the GET flow.
>> In this flow the receiver allocate one big buffer to receive the message with RDMA read in one chunk.
>> There is no fragmentation mechanism in this flow which make it harder to solve this issue
>>
>
> Doh. I'm afraid I don't know why this was done this way originally...
>
>
>> Reading the max message size supported by the HCA can be done by using verbs.
>>
>> The second approach is to use RDMA direct only if the message size is smaller than the max message size supported by the HCA.
>>
>> Here is where the long message protocol is chosen:
>> ompi/mca/pml/ob1/pml_ob1_sendreq.h line 382.
>>
>> We could use the second approach until a fragmentation mechanism will be added to the RDMA direct GET flow.
>>
>
> Are you suggesting that pml_ob1_sendreq.h:382 compare the message length to the btl.max_message_size and choose RDMA direct vs. RDMA pipelined? If so, that might be sufficient.
>
> But what to do about the peer's max message size?
>
>
I thought of a different approach:
Instead of limiting the passing to the mca_bml_base_prepare_dst(), we
can limit the size in mca_btl_openib_prepare_dst().
I believe this is better solution because it only effects the internal
behavior of the openib btl.
In mca_btl_openib_prepare_dst() we have access to both max_msg_sz (local
and endpoint).
This will fix the PUT flow.
For the GET flow, maybe we should check in
mca_pml_ob1_send_request_start_rdma() -
if the message size is larger then the minimum between both endpoints'
max_msg_sz force it use the PUT flow.
The problem is that I'm not sure how to do it without an *ugly hack*.
We need to to access the openib btl parameters from the
mca_pml_ob1_send_request_start_rdma().
The second options it to do it from pml_ob1_sendreq.h:382, but then
again, it requires access to the openib btl parameters...
Any thoughts?
Thanks,
Doron
|