Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: changes to modex
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-04-03 11:16:21

On Apr 3, 2008, at 8:52 AM, Gleb Natapov wrote:
>> It'll increase it compared to the optimization that we're about to
>> make. But it will certainly be a large decrease compared to what
>> we're doing today
> May be I don't understand something in what you propose then.
> Currently
> when I run two procs on the same node and each proc uses different HCA
> each one of them sends message that describes the HCA in use by the
> proc. The message is of the form <mtu, subnet, lid, apm_lid, cpc>.
> Each proc send one of those so there are two message total on the
> wire.
> You propose that one of them should send description of both
> available ports (that is one of them sends two messages of the form
> above) and then each proc send additional message with the index of
> the
> HCA that it is going to use. And this is more data on the wire after
> proposed optimization than we have now.

I guess what I'm trying to address is optimizing the common case.
What I perceive the common case to be is:

- high PPN values (4, 8, 16, ...)
- PPN is larger than the number of verbs-capable ports
- homogeneous openfabrics network

Yes, you will definitely find other cases. But I'd guess that this
is, by far, the most common case (especially at scale). I don't want
to penalize the common case for the sake of some one-off installations.

I'm basing this optimization on the assumption that PPN's will be
larger than the number of available ports, so there will guarantee to
be duplication in the modex message. Removing that duplication is the
main goal of this optimization.

>> (see the spreadsheet that I sent last week).
> I've looked at it but I could not decipher it :( I don't understand
> where all these numbers a come from.

Why didn't you ask? :-)

The size of the openib modex is explained in btl_openib_component.c in
the branch. It's a packed message now; we don't just blindly copy an
entire struct. Here's the comment:

     /* The message is packed into multiple parts:
      * 1. a uint8_t indicating the number of modules (ports) in the
      * 2. for each module:
      * a. the common module data
      * b. a uint8_t indicating how many CPCs follow
      * c. for each CPC:
      * a. a uint8_t indicating the index of the CPC in the all[]
      * array in btl_openib_connect_base.c
      * b. a uint8_t indicating the priority of this CPC
      * c. a uint8_t indicating the length of the blob to follow
      * d. a blob that is only meaningful to that CPC

The common module data is what I sent in the other message.

>> I guess I don't see the problem...?
> I like things to be simple. KISS principle I guess.

I can see your point that this is getting fairly complicated. :-\
See below.

> And I do care about
> heterogeneous include/exclude too.

How much? I still think we can support it just fine; I just want to
make [what I perceive to be] the common case better.

> I looked at what kind of data we send during openib modex and I
> created
> file with 10000 openib modex messages. mtu, subnet id and cpc list
> where
> the same in each message but lid/apm_lid where different, this is
> pretty close approximation of the data that is sent from HN to each
> process. The uncompressed file size is 489K compressed file size is
> 43K.
> More then 10 times smaller.

Was this the full modex message, or just the openib part?

Those are promising sizes (43k), though; how long does it take to
compress/uncompress this data in memory? That also must be factored
into the overall time.

How about a revised and combined proposal:

- openib: Use a simplified "send all ACTIVE ports" per-host message
(i.e., before include/exclude and carto is applied)
- openib: Send a small bitmap for each proc indicating which ports
each btl module will use
- modex: Compress the result (probably only if it's larger than some
threshhold size?) when sending, decompress upon receive

This keeps it simple -- no special cases for heterogeneous include/
exclude, etc. And if compression is cheap (can you do some
experiments to find out?), perhaps we can link against libz (I see the
libz in at least RHEL4 is BSD licensed, so there's no issue there) and
de/compress in memory.

Jeff Squyres
Cisco Systems