Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] PML selection logic
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-06-26 09:26:29


Just to complete this thread...

Brian raised a very good point, so we identified it on the weekly telecon as
a subject that really should be discussed at next week's technical meeting.
I think we can find a reasonable answer, but there are several ways it can
be done. So rather than doing our usual piecemeal approach to the solution,
it makes sense to begin talking about a more holistic design for
accommodating both needs.

Thanks Brian for pointing out the bigger picture.
Ralph

On 6/24/08 8:22 AM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:

> yeah, that could be a problem, but it's such a minority case and we've got
> to draw the line somewhere.
>
> Of course, it seems like this is a never ending battle between two
> opposing forces... The desire to do the "right thing" all the time at
> small and medium scale and the desire to scale out to the "big thing".
> It seems like in the quest to kill off the modex, we've run into these
> pretty often.
>
> The modex doesn't hurt us at small scale (indeed, we're probably ok with
> the routed communication pattern up to 512 nodes or so if we don't do
> anything stupid, maybe further). Is it time to admit defeat in this
> argument and have a configure option that turns off the modex (at the cost
> of some of these correctness checks) for the large machines, but keeps
> things simple for the common case? I'm sure there are other things where
> this will come up, so perhaps a --enable-large-scale? Maybe it's a dumb
> idea, but it seems like we've made a lot of compromises lately around
> this, where no one ends up really happy with the solution :/.
>
> Brian
>
>
> On Tue, 24 Jun 2008, George Bosilca wrote:
>
>> Brian hinted a possible bug in one of his replies. How does this work in the
>> case of dynamic processes? We can envision several scenarios, but lets take a
>> simple: 2 jobs that get connected with connect/accept. One might publish the
>> PML name (simply because the -mca argument was on) and one might not?
>>
>> george.
>>
>> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:
>>
>>> Also sounds good to me.
>>>
>>> Note that the most difficult part of the forward-looking plan is that we
>>> usually can't tell the difference between "something failed to initialize"
>>> and "you don't have support for feature X".
>>>
>>> I like the general philosophy of: running out of the box always works just
>>> fine, but if you/the sysadmin is smart, you can get performance
>>> improvements.
>>>
>>>
>>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>>>
>>>> I concur
>>>> - galen
>>>>
>>>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>>>>
>>>>> That sounds like a reasonable plan to me.
>>>>>
>>>>> Brian
>>>>>
>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>
>>>>>> Okay, so let's explore an alternative that preserves the support you are
>>>>>> seeking for the "ignorant user", but doesn't penalize everyone else.
>>>>>> What we
>>>>>> could do is simply set things up so that:
>>>>>>
>>>>>> 1. if -mca plm xyz is provided, then no modex data is added
>>>>>>
>>>>>> 2. if it is not provided, then only rank=0 inserts the data. All other
>>>>>> procs
>>>>>> simply check their own selection against the one given by rank=0
>>>>>>
>>>>>> Now, if a knowledgeable user or sys admin specifies what to use for
>>>>>> their
>>>>>> system, we won't penalize their startup time. A user who doesn't know
>>>>>> what
>>>>>> to do gets to run, albeit less scalably on startup.
>>>>>>
>>>>>> Looking forward from there, we can look to a day where failing to
>>>>>> initialize
>>>>>> something that exists on the system could be detected in some other
>>>>>> fashion,
>>>>>> letting the local proc abort since it would know that other procs that
>>>>>> detected similar capabilities may well have selected that PML. For now,
>>>>>> though, this would solve the problem.
>>>>>>
>>>>>> Make sense?
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>>>>>>
>>>>>>> The problem is that we default to OB1, but that's not the right choice
>>>>>>> for
>>>>>>> some platforms (like Pathscale / PSM), where there's a huge performance
>>>>>>> hit for using OB1. So we run into a situation where user installs Open
>>>>>>> MPI, starts running, gets horrible performance, bad mouths Open MPI,
>>>>>>> and
>>>>>>> now we're in that game again. Yeah, the sys admin should know what to
>>>>>>> do,
>>>>>>> but it doesn't always work that way.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>
>>>>>>>> My fault - I should be more precise in my language. ;-/
>>>>>>>>
>>>>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a modex. It
>>>>>>>> seems
>>>>>>>> to me that a simpler solution to what you describe is for the user to
>>>>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you could
>>>>>>>> deal
>>>>>>>> with the failed-to-initialize problem cleanly by having the proc
>>>>>>>> directly
>>>>>>>> abort.
>>>>>>>>
>>>>>>>> Again, sometimes I think we attempt to automate too many things. This
>>>>>>>> seems
>>>>>>>> like a pretty clear case where you know what you want - the sys admin,
>>>>>>>> if
>>>>>>>> nobody else, can certainly set that mca param in the default param
>>>>>>>> file!
>>>>>>>>
>>>>>>>> Otherwise, it seems to me that you are relying on the modex to detect
>>>>>>>> that
>>>>>>>> your proc failed to init the correct subsystem. I hate to force a
>>>>>>>> modex just
>>>>>>>> for that - if so, then perhaps this could again be a settable option
>>>>>>>> to
>>>>>>>> avoid requiring non-scalable behavior for those of us who want
>>>>>>>> scalability?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> The selection code was added because frequently high speed
>>>>>>>>> interconnects
>>>>>>>>> fail to initialize properly due to random stuff happening (yes,
>>>>>>>>> that's a
>>>>>>>>> horrible statement, but true). We ran into a situation with some
>>>>>>>>> really
>>>>>>>>> flaky machines where most of the processes would chose CM, but a
>>>>>>>>> couple
>>>>>>>>> would fail to initialize the MTL and therefore chose OB1. This lead
>>>>>>>>> to a
>>>>>>>>> hang situation, which is the worst of the worst.
>>>>>>>>>
>>>>>>>>> I think #1 is adequate, although it doesn't handle spawn particularly
>>>>>>>>> well. And spawn is generally used in environments where such network
>>>>>>>>> mismatches are most likely to occur.
>>>>>>>>>
>>>>>>>>> Brian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>>>
>>>>>>>>>> Since my goal is to eliminate the modex completely for managed
>>>>>>>>>> installations, could you give me a brief understanding of this
>>>>>>>>>> eventual PML
>>>>>>>>>> selection logic? It would help to hear an example of how and why
>>>>>>>>>> different
>>>>>>>>>> procs could get different answers - and why we would want to allow
>>>>>>>>>> them to
>>>>>>>>>> do so.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Ralph
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The first approach sounds fair enough to me. We should avoid 2 and
>>>>>>>>>>> 3
>>>>>>>>>>> as the pml selection mechanism used to be
>>>>>>>>>>> more complex before we reduced it to accommodate a major design bug
>>>>>>>>>>> in
>>>>>>>>>>> the BTL selection process. When using the complete PML selection,
>>>>>>>>>>> BTL
>>>>>>>>>>> would be initialized several times, leading to a variety of bugs.
>>>>>>>>>>> Eventually the PML selection should return to its old self, when
>>>>>>>>>>> the
>>>>>>>>>>> BTL bug gets fixed.
>>>>>>>>>>>
>>>>>>>>>>> Aurelien
>>>>>>>>>>>
>>>>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>>>>>>
>>>>>>>>>>>> Yo all
>>>>>>>>>>>>
>>>>>>>>>>>> I've been doing further research into the modex and came across
>>>>>>>>>>>> something I
>>>>>>>>>>>> don't fully understand. It seems we have each process insert into
>>>>>>>>>>>> the modex
>>>>>>>>>>>> the name of the PML module that it selected. Once the modex has
>>>>>>>>>>>> exchanged
>>>>>>>>>>>> that info, it then loops across all procs in the job to check
>>>>>>>>>>>> their
>>>>>>>>>>>> selection, and aborts if any proc picked a different PML module.
>>>>>>>>>>>>
>>>>>>>>>>>> All well and good...assuming that procs actually -can- choose
>>>>>>>>>>>> different PML
>>>>>>>>>>>> modules and hence create an "abort" scenario. However, if I look
>>>>>>>>>>>> inside the
>>>>>>>>>>>> PML's at their selection logic, I find that a proc can ONLY pick a
>>>>>>>>>>>> module
>>>>>>>>>>>> other than ob1 if:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or by
>>>>>>>>>>>> using a
>>>>>>>>>>>> module specific mca param to adjust its priority. In this case,
>>>>>>>>>>>> since the
>>>>>>>>>>>> mca param is propagated, ALL procs have no choice but to pick that
>>>>>>>>>>>> same
>>>>>>>>>>>> module, so that can't cause us to abort (we will have already
>>>>>>>>>>>> returned an
>>>>>>>>>>>> error and aborted if the specified module can't run).
>>>>>>>>>>>>
>>>>>>>>>>>> 2. the pml/cm module detects that an MTL module was selected, and
>>>>>>>>>>>> that it is
>>>>>>>>>>>> other than "psm". In this case, the CM module will be selected
>>>>>>>>>>>> because its
>>>>>>>>>>>> default priority is higher than that of OB1.
>>>>>>>>>>>>
>>>>>>>>>>>> In looking deeper into the MTL selection logic, it appears to me
>>>>>>>>>>>> that you
>>>>>>>>>>>> either have the required capability or you don't. I can see that
>>>>>>>>>>>> in
>>>>>>>>>>>> some
>>>>>>>>>>>> environments (e.g., rsh across unmanaged collections of machines),
>>>>>>>>>>>> it might
>>>>>>>>>>>> be possible for someone to launch across a set of machines where
>>>>>>>>>>>> some do and
>>>>>>>>>>>> some don't have the required support. However, in all other cases,
>>>>>>>>>>>> this will
>>>>>>>>>>>> be homogeneous across the system.
>>>>>>>>>>>>
>>>>>>>>>>>> Given this analysis (and someone more familiar with the PML should
>>>>>>>>>>>> feel free
>>>>>>>>>>>> to confirm or correct it), it seems to me that this could be
>>>>>>>>>>>> streamlined via
>>>>>>>>>>>> one or more means:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. at the most, we could have rank=0 add the PML module name to
>>>>>>>>>>>> the
>>>>>>>>>>>> modex,
>>>>>>>>>>>> and other procs simply check it against their own and return an
>>>>>>>>>>>> error if
>>>>>>>>>>>> they differ. This accomplishes the identical functionality to what
>>>>>>>>>>>> we have
>>>>>>>>>>>> today, but with much less info in the modex.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. we could eliminate this info from the modex altogether by
>>>>>>>>>>>> requiring the
>>>>>>>>>>>> user to specify the PML module if they want something other than
>>>>>>>>>>>> the
>>>>>>>>>>>> default
>>>>>>>>>>>> OB1. In this case, there can be no confusion over what each proc
>>>>>>>>>>>> is
>>>>>>>>>>>> to use.
>>>>>>>>>>>> The CM module will attempt to init the MTL - if it cannot do so,
>>>>>>>>>>>> then the
>>>>>>>>>>>> job will return the correct error and tell the user that CM/MTL
>>>>>>>>>>>> support is
>>>>>>>>>>>> unavailable.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. we could again eliminate the info by not inserting it into the
>>>>>>>>>>>> modex if
>>>>>>>>>>>> (a) the default PML module is selected, or (b) the user specified
>>>>>>>>>>>> the PML
>>>>>>>>>>>> module to be used. In the first case, each proc can simply check
>>>>>>>>>>>> to
>>>>>>>>>>>> see if
>>>>>>>>>>>> they picked the default - if not, then we can insert the info to
>>>>>>>>>>>> indicate
>>>>>>>>>>>> the difference. Thus, in the "standard" case, no info will be
>>>>>>>>>>>> inserted.
>>>>>>>>>>>>
>>>>>>>>>>>> In the second case, we will already get an error if the specified
>>>>>>>>>>>> PML module
>>>>>>>>>>>> could not be used. Hence, the modex check provides no additional
>>>>>>>>>>>> info or
>>>>>>>>>>>> value.
>>>>>>>>>>>>
>>>>>>>>>>>> I understand the motivation to support automation. However, in
>>>>>>>>>>>> this
>>>>>>>>>>>> case,
>>>>>>>>>>>> the automation actually doesn't seem to buy us very much, and it
>>>>>>>>>>>> isn't
>>>>>>>>>>>> coming "free". So perhaps some change in how this is done would be
>>>>>>>>>>>> in order?
>>>>>>>>>>>>
>>>>>>>>>>>> Ralph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel