Hi,

 

The maximum message size of ConnectX HCAs is 1GB (older cards have a maximum of 2GB).

Trying to send larger messages over RDMA direct protocol will fail.

 

A reminder - RDMA direct will be used if RDMA writes or reads are allowed by |btl_openib_flags| and the sender's message is already registered (either by use of the |mpi_leave_pinned| MCA parameter <http://www.open-mpi.org/faq/?category=openfabrics#large-message-leave-pinned>

or if the buffer was allocated via MPI_ALLOC_MEM).

 

I've opened two tickets on this issue (for 1.4.4 and 1.5.2):

1.4.4: https://svn.open-mpi.org/trac/ompi/ticket/2623

1.5.2: https://svn.open-mpi.org/trac/ompi/ticket/2627

 

In order to check what is the max message size supported by the HCA you can run the command:

 

ibv_devinfo -v |grep max_msg_sz

max_msg_sz:             0x40000000

 

Attached is a simple program which uses Isend and Irecv to send a larger message (more than the max message size).

The output of this program is:

 

[[10761,1],1][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3330:handle_wc]

from boo4 to: boo3 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 4d142e0 opcode 32767  vendor error 105 qp_idx 3

 

When using RDMA direct protocol we need to distinguish between GET and PUT protocols.

If both flags (PUT and GET) are set in btl_openib_flags (which they are in default), it will use the GET flow.

 

If only the PUT flag is set and/or the btl supports only PUT method then the sender will allocate a rendezvous header and will not eager send any data. The receiver will schedule rdma PUT(s) of the entire message.

It is done in mca_pml_ob1_recv_request_schedule_once()

(ompi/mca/pml/ob1/pml_ob1_recvreq.c:683).

We can limit the size passing to mca_bml_base_prepare_dst() to be minimum between btl.max_message_size supported by the HCA and the actual message size.

The will result a fragmentation of the RDMA write messages.

 

The bigger problem is when using the GET flow.

In this flow the receiver allocate one big buffer to receive the message with RDMA read in one chunk.

There is no fragmentation mechanism in this flow which make it harder to solve this issue

 

Reading the max message size supported by the HCA can be done by using verbs.

 

The second approach is to use RDMA direct only if the message size is smaller than the max message size supported by the HCA.

 

Here is where the long message protocol is chosen:

ompi/mca/pml/ob1/pml_ob1_sendreq.h line 382.

 

We could use the second approach until a fragmentation mechanism will be added to the RDMA direct GET flow.

 

Any comments or suggestions?

 

Thanks,

Doron