+1, but I helped come up with the idea. :-)
On Jul 18, 2013, at 5:32 PM, "Barrett, Brian W" <bwbarre_at_[hidden]> wrote:
> What: Change the ompi_proc_t endpoint data lookup to be more flexible
> Why: As collectives and one-sided components are using transports
> directly, an old problem of endpoint tracking is resurfacing. We need a
> fix that doesn't suck.
> When: Assuming there are no major objections, I'll start writing the code
> next week...
> More Info:
> Today, endpoint information is stored in one of two places on the
> ompi_proc_t: proc_pml and proc_bml. The proc_pml pointer is an opaque
> structure having meaning only to the PML and the proc_bml pointer is an
> opaque structure having meaning only to the BML. CM, OB1, and BFO don't
> use proc_pml, although the MTLs store their endpoint data on the proc_pml.
> R2 uses the proc_bml to hold an opaque data structure which holds all the
> btl endpoint data.
> The specific problem is the Portals 4 collective and one-sided components.
> They both need endpoint information for communication (obviously).
> Before there was a Portals 4 BTL, they peeked at the proc_pml pointer,
> knew what it looked like, and were ok. Now the data they need is possibly
> in the proc_pml or in the (opaque) proc_bml, which poses a problem.
> Jeff and I talked about this and had a number of restrictions that seemed
> to make sense for a solution:
> * Don't make ompi_proc_t bigger than absolutely necessary
> * Avoid adding extra indirection into the endpoint resolution path
> * Allow enough flexibility that IB or friends could use the same
> * Don't break the BML / BTL interface (too much work)
> What we came up with was a two pronged approach, depending on run-time
> First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we
> would have a proc_endpoint array of fixed size. The size of the array
> would be determined at compile time based on compile-time registering of
> endpoint slots. At compile time, a #define with a component's slot would
> be set, removing any extra indexing overhead over today's mechanism. So
> R2 would have a call in it's configure.m4 like:
> And would then find it's endpoint data with a call like:
> r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2];
> which (assuming modest compiler optimization) is instruction equivalent to:
> r2_endpoint = proc->proc_bml;
> To allow for dynamic indexing (something we haven't had to date), the last
> entry in the array would be a pointer to an object like an
> opal_pointer_array, but without the locking, and some allocation calls
> during init. Since the indexes never need to be used by a remote process,
> there's no synchronization required in registering. The dynamic indexing
> could be turned off at configure time for space-concious builds. For
> example, on our big systems, I disable dlopen support, so static
> allocation of endpoint slots is good enough.
> In the average build, the only tag registered would be BML_R2. If we lazy
> allocate the pointer array element, that's two entries in the
> proc_endpoint array, so the same size as today. I was going to have the
> CM stop using the endpoint and push that handling on the MTL. Assuming
> all MTLs but Portals shared the same tag (easy to do), there'd be an
> 8*nprocs increase in space used per process if an MTL was built, but if
> you disabled R2, that disappears.
> How does this solve my problem? Rather than having Portals 4 use the MTL
> tag, it would have it's own tag, shared between the MTL, BTL, OSC, and
> COLL components. Since the chances of Portals 4 being built on a platform
> with support for another MTL is almost zero, in most cases, the size of
> the ompi_proc_t only increases by 8 bytes over today's setup. Since most
> Portals 4 builds will be on more static platforms, I can disable dynamic
> indexing and be back at today's size, but with an easy way to deal with
> endpoint data sharing between components of different frameworks.
> So, to review our original goals:
> * ompi_proc_t will remain the same size on most platforms, increase by
> 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on
> static systems (by disabling dynamic indexing and building only one of
> either the MTLs or BMLs).
> * If you're using a pre-allocated tag, there's no extra indirection or
> math, assuming basic compiler optimization. There is a higher cost for
> dynamic tags, but that's probably ok for us.
> * I think that IB could start registering a tag if it needed for sharing
> QP information between frameworks, at the cost of an extra tag. Probably
> makes the most sense for the MXM case (assuming someone writes an MXM osc
> * The PML interface would change slightly (remove about 5 lines of code
> / pml). The MTL would have to change a bit to look at their own tag
> instead of the proc_pml (fairly easy). The R2 BML would need to change to
> use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that
> shouldn't be hard. The consumers of the BML (OB1, BFO, RDMA OSC, etc.)
> would not have to change.
> I know RFCs are usually sent after the code is written, but I wanted some
> thoughts before I started coding, since it's kind of a high impact change
> to a performance-critical piece of OMPI.
> Brian W. Barrett
> Scalable System Software Group
> Sandia National Laboratories
> devel mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/