Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: changes to modex
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-04-03 09:05:28


Hmmmm...since I have no control nor involvement in what gets sent, perhaps I
can be a disinterested third party. ;-)

Could you perhaps explain this comment:

> BTW I looked at how we do modex now on the trunk. For OOB case more
> than half the data we send for each proc is garbage.

What "garbage" are you referring to? I am working to remove the stuff
inserted by proc.c - mostly hostname, hopefully arch, etc. If you are
running a "debug" version, there will also be type descriptors for each
entry, but those are eliminated for optimized builds.

So are you referring to other things?

Thanks
Ralph

On 4/3/08 6:52 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:

> On Wed, Apr 02, 2008 at 08:41:14PM -0400, Jeff Squyres wrote:
>>>> that it's the same for all procs on all hosts. I guess there's a few
>>>> cases:
>>>>
>>>> 1. homogeneous include/exclude, no carto: send all in node info; no
>>>> proc info
>>>> 2. homogeneous include/exclude, carto is used: send all ports in node
>>>> info; send index in proc info for which node info port index it
>>>> will use
>>> This may actually increase modex size. Think about two procs using two
>>> different hcas. We'll send all the data we send today + indexes.
>>
>> It'll increase it compared to the optimization that we're about to
>> make. But it will certainly be a large decrease compared to what
>> we're doing today
>
> May be I don't understand something in what you propose then. Currently
> when I run two procs on the same node and each proc uses different HCA
> each one of them sends message that describes the HCA in use by the
> proc. The message is of the form <mtu, subnet, lid, apm_lid, cpc>.
> Each proc send one of those so there are two message total on the wire.
> You propose that one of them should send description of both
> available ports (that is one of them sends two messages of the form
> above) and then each proc send additional message with the index of the
> HCA that it is going to use. And this is more data on the wire after
> proposed optimization than we have now.
>
>
>> (see the spreadsheet that I sent last week).
> I've looked at it but I could not decipher it :( I don't understand
> where all these numbers a come from.
>
>>
>> Indeed, we can even put in the optimization that if there's only one
>> process on a host, it can only publish the ports that it will use (and
>> therefore there's no need for the proc data).
> More special cases :(
>
>>
>>>> 3. heterogeneous include/exclude, no cart: need user to tell us that
>>>> this situation exists (e.g., use another MCA param), but then is same
>>>> as #2
>>>> 4. heterogeneous include/exclude, cart is used, same as #3
>>>>
>>>> Right?
>>>>
>>> Looks like it. FWIW I don't like the idea to code all those special
>>> cases. The way it works now I can be pretty sure that any crazy setup
>>> I'll come up with will work.
>>
>> And so it will with the new scheme. The only place it won't work is
>> if the user specifies a heterogeneous include/exclude (i.e., we'll
>> require that the user tells us when they do that), which nobody does.
>>
>> I guess I don't see the problem...?
> I like things to be simple. KISS principle I guess. And I do care about
> heterogeneous include/exclude too.
>
> BTW I looked at how we do modex now on the trunk. For OOB case more
> than half the data we send for each proc is garbage.
>
>>
>>> By the way how much data are moved during modex stage? What if modex
>>> will use compression?
>>
>>
>> The spreadsheet I listed was just the openib part of the modex, and it
>> was fairly hefty. I have no idea how well (or not) it would compress.
>>
> I looked at what kind of data we send during openib modex and I created
> file with 10000 openib modex messages. mtu, subnet id and cpc list where
> the same in each message but lid/apm_lid where different, this is
> pretty close approximation of the data that is sent from HN to each
> process. The uncompressed file size is 489K compressed file size is 43K.
> More then 10 times smaller.
>
> --
> Gleb.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel