Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-07-25 16:56:02


On Jul 25, 2007, at 10:39 AM, Brian Barrett wrote:

> I have an even bigger objection than Rich. It's near impossible to
> measure the latency impact of something like this, but it does have
> an additive effect. It doesn't make sense to have all that code in
> the critical path for systems where it's not needed. We should leave
> the compile time decision available, unless there's a very good
> reason (which I did not see in this e-mail) to remove it.

It just adds a lot of #if's throughout the code. Other than that,
there's no reason to remove it.

> Brian
>
> On Jul 25, 2007, at 8:00 AM, Richard Graham wrote:
>
>> This is good work, so I am happy to see it come over. My initial
>> understanding was that
>> there would be compile time protection for this. In the absence
>> of this, I think we need
>> to see performance data on a variety of communication substrates.
>> It seems like a latency
>> measurement is, perhaps, the most sensitive measurement, and
>> should be sufficient to
>> see the impact on the critical path.
>>
>> Rich
>>
>>
>> On 7/25/07 9:04 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>
>>> WHAT: Merge the sparse groups work to the trunk; get the
>>> community's
>>> opinion on one remaining issue.
>>> WHY: For large MPI jobs, it can be memory-prohibitive to fully
>>> represent dense groups; you can save a lot of space by
>>> having
>>> "sparse" representations of groups that are (for example)
>>> derived from MPI_COMM_WORLD.
>>> WHERE: Main changes are (might have missed a few in this analysis,
>>> but this is 99% of it):
>>> - Big changes in ompi/group
>>> - Moderate changes in ompi/comm
>>> - Trivial changes in ompi/mpi/c, ompi/mca/pml/[dr|ob1],
>>> ompi/mca/comm/sm
>>> WHEN: The code is ready now in /tmp/sparse-groups (it is passing
>>> all Intel and IBM tests; see below).
>>> TIMEOUT: We'll merge all the work to the trunk and enable the
>>> possibility of using sparse groups (dense will still be
>>> the
>>> default, of course) if no one objects by COB Tuesday, 31
>>> Aug
>>> 2007.
>>>
>>> ====================================================================
>>> =
>>> ===
>>> ===
>>>
>>> The sparse groups work from U. Houston is ready to be brought into
>>> the
>>> trunk. It is built on the premise that for very large MPI jobs, you
>>> don't want to fully represent MPI groups in memory if you don't have
>>> to. Specifically, you can save memory for communicators/groups that
>>> are derived from MPI_COMM_WORLD by representing them in a sparse
>>> storage format.
>>>
>>> The sparse groups work introduces 3 new ompi_group_t storage
>>> formats:
>>>
>>> * dense (i.e., what it is today -- an array of ompi_proc_t pointers)
>>> * sparse, where the current group's contents are based on the group
>>> from which the child was derived:
>>> 1. range: a series of (offset,length) tuples
>>> 2. stride: a single (first,stride,last) tuple
>>> 3. bitmap: a bitmap
>>>
>>> Currently, all the sparse groups code must be enabled by configuring
>>> with --enable-sparse-groups. If sparse groups are enabled, each MPI
>>> group that is created will automatically use the storage format that
>>> takes the least amount of space.
>>>
>>> The Big Issue with the sparse groups is that getting a pointer to an
>>> ompi_proc_t may no longer be an O(1) operation -- you can't just
>>> access it via comm->group->procs[i]. Instead, you have to call a
>>> macro. If sparse groups are enabled, this will call a function
>>> to do
>>> the resolution and return the proc pointer. If sparse groups are
>>> not
>>> enabled, the macro currently resolves to group->procs[i].
>>>
>>> When sparse groups are enabled, looking up a proc pointer is an
>>> iterative process; you have to traverse up through one or more
>>> parent
>>> groups until you reach a "dense" group to get the pointer. So the
>>> time to lookup the proc pointer (essentially) depends on the group
>>> and
>>> how many times it has been derived from a parent group (there are
>>> corner cases where the lookup time is shorter). Lookup times in
>>> MPI_COMM_WORLD are O(1) because it is dense, but it now requires an
>>> inline function call rather than directly accessing the data
>>> structure (see below).
>>>
>>> Note that the code in /tmp/sparse-groups is currently out-of-date
>>> with
>>> respect to the orte and opal trees due to SVN merge mistakes and
>>> problems. Testing has occurred by copying full orte/opal branches
>>> from a trunk checkout into the sparse group tree, so we're confident
>>> that it's compatible with the trunk. Full integration will occur
>>> before commiting to the trunk, of course.
>>>
>>> The proposal we have for the community is as follows:
>>>
>>> 1. Remove the --enable-sparse-groups configure option
>>> 2. Default to use only dense groups (i.e., same as today)
>>> 3. If the new MCA parameter "mpi_use_sparse_groups" is enabled,
>>> enable
>>> the use of sparse groups
>>> 4. Eliminate the current macro used for group proc lookups and
>>> instead
>>> use an inline function of the form:
>>>
>>> static inline ompi_proc_t lookup_group(ompi_group_t *group, int
>>> index) {
>>> if (group_is_dense(group)) {
>>> return group->procs[index];
>>> } else {
>>> return sparse_group_lookup(group, index);
>>> }
>>> }
>>>
>>> *** NOTE: This design adds a single "if" in some
>>> performance-critical paths. If the group is sparse, it will
>>> add a function call and the overhead to do the lookup.
>>> If the group is dense (which will be the default), the proc
>>> will be returned directly from the inline function.
>>>
>>> The rationale is that adding a single "if" (perhaps with
>>> OPAL_[UN]LIKELY?) in a few code paths will not be a big deal.
>>>
>>> 5. Bring all these changes into the OMPI trunk and therefore into
>>> v1.3.
>>>
>>> Comments?
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems