Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Sending large messages over RDMA fails
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-12-07 21:21:09


Doron --

I chatted with George about this today (we're both at the Forum together this week). We're in this situation because of some complicated history.

1. At one time, both PUT and GET protocols worked fine in OB1.
2. PUT was the default.
3. Over time, GET got broken (because it was rarely used).
4. Someone eventually fixed the GET protocol, but did not implement pipelining.

Hence, the problem is in the OB1 PML: the PUT protocol has pipelining implemented, but the GET protocol does not. More specifically, PUT goes something like this:

1. MPI_SEND is invoked with a 10GB message
2. ...some setup and intermediate stuff...
3. OB1 send calls a back-end BTL's PUT with the buffer pointer and a length of 10GB
4. The BTL does a PUT of the largest message it can (e.g., 2GB) and returns back up to OB1 saying, "Sorry, I was only able to PUT 2GB"
5. OB1 then falls back to the pipelined protocol

The GET protocol won't fall back to pipelining -- as I understand it (and George please correct me if this is wrong), that code simply doesn't exist at this point.

So I think there's 2 options on how to go forward:

A. add the GET pipelining code to OB1 (probably similar to the PUT scheme; let the BTL fail and say "I was only able to GET 2GB...", etc.).

B. disable the GET protocol (maybe only in the openib BTL...?). Only openib and GM cared about GET/PUT in ob1 and gm is long dead.

I think that A. is preferable because the ob1 GET protocol has the advantage of having hardware acceleration of RDMA GET. As opposed to involving the sender OB1 stack in PUT -- meaning additional latency, not only because the sender OB1 is involved, but also because the sender may not be in OB1 when the receiver CTS arrives.

Make sense?

On Dec 5, 2010, at 8:28 AM, Doron Shoham wrote:

> Jeff Squyres wrote:
>> On Nov 29, 2010, at 3:51 AM, Doron Shoham wrote:
>>
>>
>>
>>> If only the PUT flag is set and/or the btl supports only PUT method then the sender will allocate a rendezvous header and will not eager send any data. The receiver will schedule rdma PUT(s) of the entire message.
>>> It is done in mca_pml_ob1_recv_request_schedule_once()
>>> (ompi/mca/pml/ob1/pml_ob1_recvreq.c:683).
>>> We can limit the size passing to mca_bml_base_prepare_dst() to be minimum between btl.max_message_size supported by the HCA and the actual message size.
>>> The will result a fragmentation of the RDMA write messages.
>>>
>>>
>>
>> I would think that we should set btl.max_message_size during init to be the minimum of the MCA param and the max supported by the HCA, right? Then there's no need for this min() in the critical path.
>>
>> Additionally, the message must be smaller than the max message size of *both* HCAs, right? So it might be necessary to add the max message size into the openib BTL modex data so that you can use it in mca_bml_base_prepare_dst() (or whatever -- been a long time since I've mucked around in there...) to compute the min between the two peers.
>>
>> So you might still need a min, but for a different reason than what you originally mentioned.
>>
>>
> It is my mistake - the btl.max_message_size is a different parameter. It is more like software limitation rather then hardware limitation from the HCA.
> I don't think that in RDMA flow it has any meaning.
>
> Can you please explain a bit more about the openib BTL modex?
>
>>
>>> The bigger problem is when using the GET flow.
>>> In this flow the receiver allocate one big buffer to receive the message with RDMA read in one chunk.
>>> There is no fragmentation mechanism in this flow which make it harder to solve this issue
>>>
>>>
>>
>> Doh. I'm afraid I don't know why this was done this way originally...
>>
>>
>>
>>> Reading the max message size supported by the HCA can be done by using verbs.
>>>
>>> The second approach is to use RDMA direct only if the message size is smaller than the max message size supported by the HCA.
>>>
>>> Here is where the long message protocol is chosen:
>>> ompi/mca/pml/ob1/pml_ob1_sendreq.h line 382.
>>>
>>> We could use the second approach until a fragmentation mechanism will be added to the RDMA direct GET flow.
>>>
>>>
>>
>> Are you suggesting that pml_ob1_sendreq.h:382 compare the message length to the btl.max_message_size and choose RDMA direct vs. RDMA pipelined? If so, that might be sufficient.
>>
>> But what to do about the peer's max message size?
>>
>>
>>
>
> I thought of a different approach:
> Instead of limiting the passing to the mca_bml_base_prepare_dst(), we can limit the size in mca_btl_openib_prepare_dst().
> I believe this is better solution because it only effects the internal behavior of the openib btl.
> In mca_btl_openib_prepare_dst() we have access to both max_msg_sz (local and endpoint).
> This will fix the PUT flow.
>
> For the GET flow, maybe we should check in mca_pml_ob1_send_request_start_rdma() -
> if the message size is larger then the minimum between both endpoints' max_msg_sz force it use the PUT flow.
>
> The problem is that I'm not sure how to do it without an *ugly hack*.
> We need to to access the openib btl parameters from the mca_pml_ob1_send_request_start_rdma().
>
> The second options it to do it from pml_ob1_sendreq.h:382, but then again, it requires access to the openib btl parameters...
>
> Any thoughts?
>
> Thanks,
> Doron
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/