On Wed, Apr 02, 2008 at 08:41:14PM -0400, Jeff Squyres wrote:
> >> that it's the same for all procs on all hosts. I guess there's a few
> >> cases:
> >> 1. homogeneous include/exclude, no carto: send all in node info; no
> >> proc info
> >> 2. homogeneous include/exclude, carto is used: send all ports in node
> >> info; send index in proc info for which node info port index it
> >> will use
> > This may actually increase modex size. Think about two procs using two
> > different hcas. We'll send all the data we send today + indexes.
> It'll increase it compared to the optimization that we're about to
> make. But it will certainly be a large decrease compared to what
> we're doing today
May be I don't understand something in what you propose then. Currently
when I run two procs on the same node and each proc uses different HCA
each one of them sends message that describes the HCA in use by the
proc. The message is of the form <mtu, subnet, lid, apm_lid, cpc>.
Each proc send one of those so there are two message total on the wire.
You propose that one of them should send description of both
available ports (that is one of them sends two messages of the form
above) and then each proc send additional message with the index of the
HCA that it is going to use. And this is more data on the wire after
proposed optimization than we have now.
> (see the spreadsheet that I sent last week).
I've looked at it but I could not decipher it :( I don't understand
where all these numbers a come from.
> Indeed, we can even put in the optimization that if there's only one
> process on a host, it can only publish the ports that it will use (and
> therefore there's no need for the proc data).
More special cases :(
> >> 3. heterogeneous include/exclude, no cart: need user to tell us that
> >> this situation exists (e.g., use another MCA param), but then is same
> >> as #2
> >> 4. heterogeneous include/exclude, cart is used, same as #3
> >> Right?
> > Looks like it. FWIW I don't like the idea to code all those special
> > cases. The way it works now I can be pretty sure that any crazy setup
> > I'll come up with will work.
> And so it will with the new scheme. The only place it won't work is
> if the user specifies a heterogeneous include/exclude (i.e., we'll
> require that the user tells us when they do that), which nobody does.
> I guess I don't see the problem...?
I like things to be simple. KISS principle I guess. And I do care about
heterogeneous include/exclude too.
BTW I looked at how we do modex now on the trunk. For OOB case more
than half the data we send for each proc is garbage.
> > By the way how much data are moved during modex stage? What if modex
> > will use compression?
> The spreadsheet I listed was just the openib part of the modex, and it
> was fairly hefty. I have no idea how well (or not) it would compress.
I looked at what kind of data we send during openib modex and I created
file with 10000 openib modex messages. mtu, subnet id and cpc list where
the same in each message but lid/apm_lid where different, this is
pretty close approximation of the data that is sent from HN to each
process. The uncompressed file size is 489K compressed file size is 43K.
More then 10 times smaller.