Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Sending large messages over RDMA fails
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-11-30 15:30:36

On Nov 29, 2010, at 3:51 AM, Doron Shoham wrote:

> If only the PUT flag is set and/or the btl supports only PUT method then the sender will allocate a rendezvous header and will not eager send any data. The receiver will schedule rdma PUT(s) of the entire message.
> It is done in mca_pml_ob1_recv_request_schedule_once()
> (ompi/mca/pml/ob1/pml_ob1_recvreq.c:683).
> We can limit the size passing to mca_bml_base_prepare_dst() to be minimum between btl.max_message_size supported by the HCA and the actual message size.
> The will result a fragmentation of the RDMA write messages.

I would think that we should set btl.max_message_size during init to be the minimum of the MCA param and the max supported by the HCA, right? Then there's no need for this min() in the critical path.

Additionally, the message must be smaller than the max message size of *both* HCAs, right? So it might be necessary to add the max message size into the openib BTL modex data so that you can use it in mca_bml_base_prepare_dst() (or whatever -- been a long time since I've mucked around in there...) to compute the min between the two peers.

So you might still need a min, but for a different reason than what you originally mentioned.

> The bigger problem is when using the GET flow.
> In this flow the receiver allocate one big buffer to receive the message with RDMA read in one chunk.
> There is no fragmentation mechanism in this flow which make it harder to solve this issue

Doh. I'm afraid I don't know why this was done this way originally...

> Reading the max message size supported by the HCA can be done by using verbs.
> The second approach is to use RDMA direct only if the message size is smaller than the max message size supported by the HCA.
> Here is where the long message protocol is chosen:
> ompi/mca/pml/ob1/pml_ob1_sendreq.h line 382.
> We could use the second approach until a fragmentation mechanism will be added to the RDMA direct GET flow.

Are you suggesting that pml_ob1_sendreq.h:382 compare the message length to the btl.max_message_size and choose RDMA direct vs. RDMA pipelined? If so, that might be sufficient.

But what to do about the peer's max message size?

Jeff Squyres
For corporate legal information go to: