This is good work, so I am happy to see it come over.  My initial understanding was that
 there would be compile time protection for this.  In the absence of this, I think we need
 to see performance data on a variety of communication substrates.  It seems like a latency
 measurement is, perhaps, the most sensitive measurement, and should be sufficient to
 see the impact on the critical path.

Rich


On 7/25/07 9:04 AM, "Jeff Squyres" <jsquyres@cisco.com> wrote:

WHAT:    Merge the sparse groups work to the trunk; get the community's
          opinion on one remaining issue.
WHY:     For large MPI jobs, it can be memory-prohibitive to fully
          represent dense groups; you can save a lot of space by having
          "sparse" representations of groups that are (for example)
          derived from MPI_COMM_WORLD.
WHERE:   Main changes are (might have missed a few in this analysis,
          but this is 99% of it):
          - Big changes in ompi/group
          - Moderate changes in ompi/comm
          - Trivial changes in ompi/mpi/c, ompi/mca/pml/[dr|ob1],
            ompi/mca/comm/sm
WHEN:    The code is ready now in /tmp/sparse-groups (it is passing
          all Intel and IBM tests; see below).
TIMEOUT: We'll merge all the work to the trunk and enable the
          possibility of using sparse groups (dense will still be the
          default, of course) if no one objects by COB Tuesday, 31 Aug
          2007.

========================================================================
===

The sparse groups work from U. Houston is ready to be brought into the
trunk.  It is built on the premise that for very large MPI jobs, you
don't want to fully represent MPI groups in memory if you don't have
to.  Specifically, you can save memory for communicators/groups that
are derived from MPI_COMM_WORLD by representing them in a sparse
storage format.

The sparse groups work introduces 3 new ompi_group_t storage formats:

* dense (i.e., what it is today -- an array of ompi_proc_t pointers)
* sparse, where the current group's contents are based on the group
   from which the child was derived:
   1. range: a series of (offset,length) tuples
   2. stride: a single (first,stride,last) tuple
   3. bitmap: a bitmap

Currently, all the sparse groups code must be enabled by configuring
with --enable-sparse-groups.  If sparse groups are enabled, each MPI
group that is created will automatically use the storage format that
takes the least amount of space.

The Big Issue with the sparse groups is that getting a pointer to an
ompi_proc_t may no longer be an O(1) operation -- you can't just
access it via comm->group->procs[i].  Instead, you have to call a
macro.  If sparse groups are enabled, this will call a function to do
the resolution and return the proc pointer.  If sparse groups are not
enabled, the macro currently resolves to group->procs[i].

When sparse groups are enabled, looking up a proc pointer is an
iterative process; you have to traverse up through one or more parent
groups until you reach a "dense" group to get the pointer.  So the
time to lookup the proc pointer (essentially) depends on the group and
how many times it has been derived from a parent group (there are
corner cases where the lookup time is shorter).  Lookup times in
MPI_COMM_WORLD are O(1) because it is dense, but it now requires an
inline function call rather than directly accessing the data
structure (see below).

Note that the code in /tmp/sparse-groups is currently out-of-date with
respect to the orte and opal trees due to SVN merge mistakes and
problems.  Testing has occurred by copying full orte/opal branches
from a trunk checkout into the sparse group tree, so we're confident
that it's compatible with the trunk.  Full integration will occur
before commiting to the trunk, of course.

The proposal we have for the community is as follows:

1. Remove the --enable-sparse-groups configure option
2. Default to use only dense groups (i.e., same as today)
3. If the new MCA parameter "mpi_use_sparse_groups" is enabled, enable
    the use of sparse groups
4. Eliminate the current macro used for group proc lookups and instead
    use an inline function of the form:

    static inline ompi_proc_t lookup_group(ompi_group_t *group, int
index) {
        if (group_is_dense(group)) {
            return group->procs[index];
        } else {
            return sparse_group_lookup(group, index);
        }
    }

    *** NOTE: This design adds a single "if" in some
        performance-critical paths.  If the group is sparse, it will
        add a function call and the overhead to do the lookup.
        If the group is dense (which will be the default), the proc
        will be returned directly from the inline function.

    The rationale is that adding a single "if" (perhaps with
    OPAL_[UN]LIKELY?) in a few code paths will not be a big deal.

5. Bring all these changes into the OMPI trunk and therefore into
    v1.3.

Comments?

--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel