Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Sending large messages over RDMA fails
From: Doron Shoham (dorons_at_[hidden])
Date: 2010-12-05 11:28:05

Jeff Squyres wrote:
> On Nov 29, 2010, at 3:51 AM, Doron Shoham wrote:
>> If only the PUT flag is set and/or the btl supports only PUT method then the sender will allocate a rendezvous header and will not eager send any data. The receiver will schedule rdma PUT(s) of the entire message.
>> It is done in mca_pml_ob1_recv_request_schedule_once()
>> (ompi/mca/pml/ob1/pml_ob1_recvreq.c:683).
>> We can limit the size passing to mca_bml_base_prepare_dst() to be minimum between btl.max_message_size supported by the HCA and the actual message size.
>> The will result a fragmentation of the RDMA write messages.
> I would think that we should set btl.max_message_size during init to be the minimum of the MCA param and the max supported by the HCA, right? Then there's no need for this min() in the critical path.
> Additionally, the message must be smaller than the max message size of *both* HCAs, right? So it might be necessary to add the max message size into the openib BTL modex data so that you can use it in mca_bml_base_prepare_dst() (or whatever -- been a long time since I've mucked around in there...) to compute the min between the two peers.
> So you might still need a min, but for a different reason than what you originally mentioned.
It is my mistake - the btl.max_message_size is a different parameter. It
is more like software limitation rather then hardware limitation from
the HCA.
I don't think that in RDMA flow it has any meaning.

Can you please explain a bit more about the openib BTL modex?

>> The bigger problem is when using the GET flow.
>> In this flow the receiver allocate one big buffer to receive the message with RDMA read in one chunk.
>> There is no fragmentation mechanism in this flow which make it harder to solve this issue
> Doh. I'm afraid I don't know why this was done this way originally...
>> Reading the max message size supported by the HCA can be done by using verbs.
>> The second approach is to use RDMA direct only if the message size is smaller than the max message size supported by the HCA.
>> Here is where the long message protocol is chosen:
>> ompi/mca/pml/ob1/pml_ob1_sendreq.h line 382.
>> We could use the second approach until a fragmentation mechanism will be added to the RDMA direct GET flow.
> Are you suggesting that pml_ob1_sendreq.h:382 compare the message length to the btl.max_message_size and choose RDMA direct vs. RDMA pipelined? If so, that might be sufficient.
> But what to do about the peer's max message size?

I thought of a different approach:
Instead of limiting the passing to the mca_bml_base_prepare_dst(), we
can limit the size in mca_btl_openib_prepare_dst().
I believe this is better solution because it only effects the internal
behavior of the openib btl.
In mca_btl_openib_prepare_dst() we have access to both max_msg_sz (local
and endpoint).
This will fix the PUT flow.

For the GET flow, maybe we should check in
mca_pml_ob1_send_request_start_rdma() -
if the message size is larger then the minimum between both endpoints'
max_msg_sz force it use the PUT flow.

The problem is that I'm not sure how to do it without an *ugly hack*.
We need to to access the openib btl parameters from the

The second options it to do it from pml_ob1_sendreq.h:382, but then
again, it requires access to the openib btl parameters...

Any thoughts?