Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] RFC: extend the BTL interface to include atomic operations
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2014-05-19 17:03:29

What: I want to extend the BTL interface to add support for atomic
operations. The initial cut presented here includes support for the
following features:

 - Support for atomic memory operations if supported by the
   hardware. The attached patch includes support for add, and, or, and
   xor. Adding additional operations should be trivial.

 - Support for both fetching (btl_fop) and non-fetching (btl_op)

 - Support for blocking or non-blocking operation. Admittedly, I have
   only implemented blocking versions for ugni so I am not sure if the
   interface is ideal. Feedback on the non-blocking support would be
   very appreciated.

 - Support for compare-and-swap operations.

For simplicity this interface only supports 64-bit operations. I have
not thought about how to extend the interface to support arbitrary
sizes. Would it be useful to look at adding support for 32-bit or
128-bit operations? How about other datatypes (float)?

Additionally, I added a new prepare function btl_prepare_rdma. The reasons
for this new function are: 1) prepare a fragment with no indication of
what endpoint will use it, 2) remove the need for a convertor where a
convertor is not needed. The function is meant to do what is needed to
prepare the region for arbitrary put, get, and atomic operations.

Why: To provide optimal support for one-sided operations I need access
to the following at a minimum: atomic add, atomic fetch-and-add, and
atomic compare-and-swap. This interface was designed with these
operations in mind. To give a little hint as to the performance
improvements we can realize with atomics/rdma support in osc:

Result obtained from osu_put_latency on two nodes of a Cray XE6 using
the uGNI btl.

# OSU MPI_Put latency Test
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_lock/unlock
# Size Latency (us)
0 7.07
1 7.33
2 7.44
4 7.26
8 7.29
16 7.37
32 7.08
64 7.10
128 7.22
256 7.48
512 7.70

Same benchmark, same nodes, atomic/rdma implementation:

# OSU MPI_Put latency Test
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_lock/unlock
# Size Latency (us)
0 1.28
1 2.40
2 2.41
4 2.41
8 2.40
16 2.45
32 2.47
64 2.52
128 2.53
256 2.58
512 2.65

When: This RFC is intended to start a discussion on what the atomic
interface for the BTLs should look like. I have no reservations with
completely re-thinking the interface as long as it doesn't 1) add the
convertor into my osc critical path, and 2) require allocation of btl
fragments for atomic operations. Lets plan on discussing extending the
BTLs at the June developer meeting. All I care about is getting the
design finalized in time for the 1.9 branch.

I attatched a patch with the proposed BTL extension. I am leaving out
the ugni implementation of the interface for now.

-Nathan Hjelm

  • application/pgp-signature attachment: stored