Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Change ompi_proc_t endpoint data lookup
From: George Bosilca (bosilca_at_[hidden])
Date: 2013-07-19 12:58:38


Brian,

I have few questions/comments about the proposed approach:

1. The BML endpoint structure (aka. BML proc) is well known and defined in the bml.h. So it is not technically opaque…

2. When allocating an ompi_proc_t structure you will always have to allocate for an array large enough to contain up to the max size detected during configure. There is significant potential for oversized arrays in one of the most space critical structure.

3. I don't know at which point this really matter but with this change two Open MPI libraries might become binary incompatible (if the #define is exchanged between nodes).

George.

On Jul 18, 2013, at 23:32 , "Barrett, Brian W" <bwbarre_at_[hidden]> wrote:

> What: Change the ompi_proc_t endpoint data lookup to be more flexible
>
> Why: As collectives and one-sided components are using transports
> directly, an old problem of endpoint tracking is resurfacing. We need a
> fix that doesn't suck.
>
> When: Assuming there are no major objections, I'll start writing the code
> next week...
>
> More Info:
>
> Today, endpoint information is stored in one of two places on the
> ompi_proc_t: proc_pml and proc_bml. The proc_pml pointer is an opaque
> structure having meaning only to the PML and the proc_bml pointer is an
> opaque structure having meaning only to the BML. CM, OB1, and BFO don't
> use proc_pml, although the MTLs store their endpoint data on the proc_pml.
> R2 uses the proc_bml to hold an opaque data structure which holds all the
> btl endpoint data.
>
> The specific problem is the Portals 4 collective and one-sided components.
> They both need endpoint information for communication (obviously).
> Before there was a Portals 4 BTL, they peeked at the proc_pml pointer,
> knew what it looked like, and were ok. Now the data they need is possibly
> in the proc_pml or in the (opaque) proc_bml, which poses a problem.
>
> Jeff and I talked about this and had a number of restrictions that seemed
> to make sense for a solution:
>
> * Don't make ompi_proc_t bigger than absolutely necessary
> * Avoid adding extra indirection into the endpoint resolution path
> * Allow enough flexibility that IB or friends could use the same
> mechanism
> * Don't break the BML / BTL interface (too much work)
>
> What we came up with was a two pronged approach, depending on run-time
> needs.
>
> First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we
> would have a proc_endpoint[] array of fixed size. The size of the array
> would be determined at compile time based on compile-time registering of
> endpoint slots. At compile time, a #define with a component's slot would
> be set, removing any extra indexing overhead over today's mechanism. So
> R2 would have a call in it's configure.m4 like:
>
> OMPI_REQUIRE_ENDPOINT_TAG(BML_R2)
>
> And would then find it's endpoint data with a call like:
>
> r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2];
>
> which (assuming modest compiler optimization) is instruction equivalent to:
>
> r2_endpoint = proc->proc_bml;
>
> To allow for dynamic indexing (something we haven't had to date), the last
> entry in the array would be a pointer to an object like an
> opal_pointer_array, but without the locking, and some allocation calls
> during init. Since the indexes never need to be used by a remote process,
> there's no synchronization required in registering. The dynamic indexing
> could be turned off at configure time for space-concious builds. For
> example, on our big systems, I disable dlopen support, so static
> allocation of endpoint slots is good enough.
>
> In the average build, the only tag registered would be BML_R2. If we lazy
> allocate the pointer array element, that's two entries in the
> proc_endpoint array, so the same size as today. I was going to have the
> CM stop using the endpoint and push that handling on the MTL. Assuming
> all MTLs but Portals shared the same tag (easy to do), there'd be an
> 8*nprocs increase in space used per process if an MTL was built, but if
> you disabled R2, that disappears.
>
> How does this solve my problem? Rather than having Portals 4 use the MTL
> tag, it would have it's own tag, shared between the MTL, BTL, OSC, and
> COLL components. Since the chances of Portals 4 being built on a platform
> with support for another MTL is almost zero, in most cases, the size of
> the ompi_proc_t only increases by 8 bytes over today's setup. Since most
> Portals 4 builds will be on more static platforms, I can disable dynamic
> indexing and be back at today's size, but with an easy way to deal with
> endpoint data sharing between components of different frameworks.
>
> So, to review our original goals:
>
> * ompi_proc_t will remain the same size on most platforms, increase by
> 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on
> static systems (by disabling dynamic indexing and building only one of
> either the MTLs or BMLs).
> * If you're using a pre-allocated tag, there's no extra indirection or
> math, assuming basic compiler optimization. There is a higher cost for
> dynamic tags, but that's probably ok for us.
> * I think that IB could start registering a tag if it needed for sharing
> QP information between frameworks, at the cost of an extra tag. Probably
> makes the most sense for the MXM case (assuming someone writes an MXM osc
> component).
> * The PML interface would change slightly (remove about 5 lines of code
> / pml). The MTL would have to change a bit to look at their own tag
> instead of the proc_pml (fairly easy). The R2 BML would need to change to
> use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that
> shouldn't be hard. The consumers of the BML (OB1, BFO, RDMA OSC, etc.)
> would not have to change.
>
> I know RFCs are usually sent after the code is written, but I wanted some
> thoughts before I started coding, since it's kind of a high impact change
> to a performance-critical piece of OMPI.
>
> Thoughts?
>
> Brian
>
> --
> Brian W. Barrett
> Scalable System Software Group
> Sandia National Laboratories
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel