Soliciting input from the community:
WHAT: Modify PML cm component to remove unnecessary initializations, optimizing blocking operations
WHY: Remove overhead in fast-path by allowing a "direct mode" increases single packet latency
HOW: In PML cm, even if the request starts and ends within the scope of the blocking send/recv function,
A full request, a structure of up to 488 bytes (not including the MTL request appendix size) may be initialized.
The request includes the opmi_request_t structure, used by an underlying MTL component, the converter
which corresponds to the datatype and other parameters - some of which are stored and only used if the
request is asynchronous. This causes a significant amount of writes, especially when considering the send
buffer could be as small as several bytes.
The proposed patch introduces a "direct mode" (currently set iff the underlying MTL is "mxm", which is the
only option I had available for testing), which when on cuts most of the initialization for blocking send and
receive operations to include only the bare minimum required to function. Aside from initializing only a part
of the request structure (field like "dst" and "tag" are passed again to the MTL_CALL macro rather than use
the request struct anyway), the function uses a single pre-allocated request buffer - which is possible since
the call is blocking. Our tests show that this increases packet rate by approximately 20% with 8-byte buffers.
Note that the "redundant" if-conditions for irrelevant functions (e.g. recv_init) are removed by compiler,
since the macro substitutes and gets "if (0 == 0)".
WHERE: Most of the files in ompi/mca/pml/cm .
Joshua S. Ladd, PhD
HPC Algorithms Engineer
Cell: +1 (865) 258 - 8898