Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Change ompi_proc_t endpoint data lookup
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-07-18 21:39:20


+1, though I do have a question.

We are looking at exascale requirements, and one of the big issues is memory footprint. We currently retrieve the endpoint info for every process in the job, plus all the procs in any communicator with which we do a connect/accept - even though we probably will only communicate with a small number of them. This wastes a lot of memory at scale.

As long as we are re-working the endpoint stuff, would it be a thought to go ahead and change how we handle the above? I'm looking to switch to a lazy definition approach where we compute endpoints for procs on first-message instead of during mpi_init, retrieving the endpoint info for that proc only at that time. So instead of storing all the endpoint info for every proc in each proc, each proc only would contain the info it requires for that application.

Ideally, I'd like to see that extended to the ompi_proc_t array itself - maybe changing it to a sparse array/list of some type, so we only create that storage for procs we actually communicate to.

If you'd prefer to discuss this as a separate issue, that's fine - just something we need to work on at some point in the next year or two.

On Jul 18, 2013, at 6:26 PM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:

> +1, but I helped come up with the idea. :-)
>
>
> On Jul 18, 2013, at 5:32 PM, "Barrett, Brian W" <bwbarre_at_[hidden]> wrote:
>
>> What: Change the ompi_proc_t endpoint data lookup to be more flexible
>>
>> Why: As collectives and one-sided components are using transports
>> directly, an old problem of endpoint tracking is resurfacing. We need a
>> fix that doesn't suck.
>>
>> When: Assuming there are no major objections, I'll start writing the code
>> next week...
>>
>> More Info:
>>
>> Today, endpoint information is stored in one of two places on the
>> ompi_proc_t: proc_pml and proc_bml. The proc_pml pointer is an opaque
>> structure having meaning only to the PML and the proc_bml pointer is an
>> opaque structure having meaning only to the BML. CM, OB1, and BFO don't
>> use proc_pml, although the MTLs store their endpoint data on the proc_pml.
>> R2 uses the proc_bml to hold an opaque data structure which holds all the
>> btl endpoint data.
>>
>> The specific problem is the Portals 4 collective and one-sided components.
>> They both need endpoint information for communication (obviously).
>> Before there was a Portals 4 BTL, they peeked at the proc_pml pointer,
>> knew what it looked like, and were ok. Now the data they need is possibly
>> in the proc_pml or in the (opaque) proc_bml, which poses a problem.
>>
>> Jeff and I talked about this and had a number of restrictions that seemed
>> to make sense for a solution:
>>
>> * Don't make ompi_proc_t bigger than absolutely necessary
>> * Avoid adding extra indirection into the endpoint resolution path
>> * Allow enough flexibility that IB or friends could use the same
>> mechanism
>> * Don't break the BML / BTL interface (too much work)
>>
>> What we came up with was a two pronged approach, depending on run-time
>> needs.
>>
>> First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we
>> would have a proc_endpoint[] array of fixed size. The size of the array
>> would be determined at compile time based on compile-time registering of
>> endpoint slots. At compile time, a #define with a component's slot would
>> be set, removing any extra indexing overhead over today's mechanism. So
>> R2 would have a call in it's configure.m4 like:
>>
>> OMPI_REQUIRE_ENDPOINT_TAG(BML_R2)
>>
>> And would then find it's endpoint data with a call like:
>>
>> r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2];
>>
>> which (assuming modest compiler optimization) is instruction equivalent to:
>>
>> r2_endpoint = proc->proc_bml;
>>
>> To allow for dynamic indexing (something we haven't had to date), the last
>> entry in the array would be a pointer to an object like an
>> opal_pointer_array, but without the locking, and some allocation calls
>> during init. Since the indexes never need to be used by a remote process,
>> there's no synchronization required in registering. The dynamic indexing
>> could be turned off at configure time for space-concious builds. For
>> example, on our big systems, I disable dlopen support, so static
>> allocation of endpoint slots is good enough.
>>
>> In the average build, the only tag registered would be BML_R2. If we lazy
>> allocate the pointer array element, that's two entries in the
>> proc_endpoint array, so the same size as today. I was going to have the
>> CM stop using the endpoint and push that handling on the MTL. Assuming
>> all MTLs but Portals shared the same tag (easy to do), there'd be an
>> 8*nprocs increase in space used per process if an MTL was built, but if
>> you disabled R2, that disappears.
>>
>> How does this solve my problem? Rather than having Portals 4 use the MTL
>> tag, it would have it's own tag, shared between the MTL, BTL, OSC, and
>> COLL components. Since the chances of Portals 4 being built on a platform
>> with support for another MTL is almost zero, in most cases, the size of
>> the ompi_proc_t only increases by 8 bytes over today's setup. Since most
>> Portals 4 builds will be on more static platforms, I can disable dynamic
>> indexing and be back at today's size, but with an easy way to deal with
>> endpoint data sharing between components of different frameworks.
>>
>> So, to review our original goals:
>>
>> * ompi_proc_t will remain the same size on most platforms, increase by
>> 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on
>> static systems (by disabling dynamic indexing and building only one of
>> either the MTLs or BMLs).
>> * If you're using a pre-allocated tag, there's no extra indirection or
>> math, assuming basic compiler optimization. There is a higher cost for
>> dynamic tags, but that's probably ok for us.
>> * I think that IB could start registering a tag if it needed for sharing
>> QP information between frameworks, at the cost of an extra tag. Probably
>> makes the most sense for the MXM case (assuming someone writes an MXM osc
>> component).
>> * The PML interface would change slightly (remove about 5 lines of code
>> / pml). The MTL would have to change a bit to look at their own tag
>> instead of the proc_pml (fairly easy). The R2 BML would need to change to
>> use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that
>> shouldn't be hard. The consumers of the BML (OB1, BFO, RDMA OSC, etc.)
>> would not have to change.
>>
>> I know RFCs are usually sent after the code is written, but I wanted some
>> thoughts before I started coding, since it's kind of a high impact change
>> to a performance-critical piece of OMPI.
>>
>> Thoughts?
>>
>> Brian
>>
>> --
>> Brian W. Barrett
>> Scalable System Software Group
>> Sandia National Laboratories
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel