Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] PML selection logic
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-24 08:28:54


Also sounds good to me.

Note that the most difficult part of the forward-looking plan is that
we usually can't tell the difference between "something failed to
initialize" and "you don't have support for feature X".

I like the general philosophy of: running out of the box always works
just fine, but if you/the sysadmin is smart, you can get performance
improvements.

On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:

> I concur
> - galen
>
> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>
>> That sounds like a reasonable plan to me.
>>
>> Brian
>>
>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>
>>> Okay, so let's explore an alternative that preserves the support
>>> you are
>>> seeking for the "ignorant user", but doesn't penalize everyone
>>> else. What we
>>> could do is simply set things up so that:
>>>
>>> 1. if -mca plm xyz is provided, then no modex data is added
>>>
>>> 2. if it is not provided, then only rank=0 inserts the data. All
>>> other procs
>>> simply check their own selection against the one given by rank=0
>>>
>>> Now, if a knowledgeable user or sys admin specifies what to use
>>> for their
>>> system, we won't penalize their startup time. A user who doesn't
>>> know what
>>> to do gets to run, albeit less scalably on startup.
>>>
>>> Looking forward from there, we can look to a day where failing to
>>> initialize
>>> something that exists on the system could be detected in some
>>> other fashion,
>>> letting the local proc abort since it would know that other procs
>>> that
>>> detected similar capabilities may well have selected that PML. For
>>> now,
>>> though, this would solve the problem.
>>>
>>> Make sense?
>>> Ralph
>>>
>>>
>>>
>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>> wrote:
>>>
>>>> The problem is that we default to OB1, but that's not the right
>>>> choice for
>>>> some platforms (like Pathscale / PSM), where there's a huge
>>>> performance
>>>> hit for using OB1. So we run into a situation where user
>>>> installs Open
>>>> MPI, starts running, gets horrible performance, bad mouths Open
>>>> MPI, and
>>>> now we're in that game again. Yeah, the sys admin should know
>>>> what to do,
>>>> but it doesn't always work that way.
>>>>
>>>> Brian
>>>>
>>>>
>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>
>>>>> My fault - I should be more precise in my language. ;-/
>>>>>
>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a
>>>>> modex. It seems
>>>>> to me that a simpler solution to what you describe is for the
>>>>> user to
>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you
>>>>> could deal
>>>>> with the failed-to-initialize problem cleanly by having the proc
>>>>> directly
>>>>> abort.
>>>>>
>>>>> Again, sometimes I think we attempt to automate too many things.
>>>>> This seems
>>>>> like a pretty clear case where you know what you want - the sys
>>>>> admin, if
>>>>> nobody else, can certainly set that mca param in the default
>>>>> param file!
>>>>>
>>>>> Otherwise, it seems to me that you are relying on the modex to
>>>>> detect that
>>>>> your proc failed to init the correct subsystem. I hate to force
>>>>> a modex just
>>>>> for that - if so, then perhaps this could again be a settable
>>>>> option to
>>>>> avoid requiring non-scalable behavior for those of us who want
>>>>> scalability?
>>>>>
>>>>>
>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> The selection code was added because frequently high speed
>>>>>> interconnects
>>>>>> fail to initialize properly due to random stuff happening (yes,
>>>>>> that's a
>>>>>> horrible statement, but true). We ran into a situation with
>>>>>> some really
>>>>>> flaky machines where most of the processes would chose CM, but
>>>>>> a couple
>>>>>> would fail to initialize the MTL and therefore chose OB1. This
>>>>>> lead to a
>>>>>> hang situation, which is the worst of the worst.
>>>>>>
>>>>>> I think #1 is adequate, although it doesn't handle spawn
>>>>>> particularly
>>>>>> well. And spawn is generally used in environments where such
>>>>>> network
>>>>>> mismatches are most likely to occur.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>>
>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>
>>>>>>> Since my goal is to eliminate the modex completely for managed
>>>>>>> installations, could you give me a brief understanding of this
>>>>>>> eventual PML
>>>>>>> selection logic? It would help to hear an example of how and
>>>>>>> why different
>>>>>>> procs could get different answers - and why we would want to
>>>>>>> allow them to
>>>>>>> do so.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <bouteill_at_[hidden]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> The first approach sounds fair enough to me. We should avoid
>>>>>>>> 2 and 3
>>>>>>>> as the pml selection mechanism used to be
>>>>>>>> more complex before we reduced it to accommodate a major
>>>>>>>> design bug in
>>>>>>>> the BTL selection process. When using the complete PML
>>>>>>>> selection, BTL
>>>>>>>> would be initialized several times, leading to a variety of
>>>>>>>> bugs.
>>>>>>>> Eventually the PML selection should return to its old self,
>>>>>>>> when the
>>>>>>>> BTL bug gets fixed.
>>>>>>>>
>>>>>>>> Aurelien
>>>>>>>>
>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>>>
>>>>>>>>> Yo all
>>>>>>>>>
>>>>>>>>> I've been doing further research into the modex and came
>>>>>>>>> across
>>>>>>>>> something I
>>>>>>>>> don't fully understand. It seems we have each process insert
>>>>>>>>> into
>>>>>>>>> the modex
>>>>>>>>> the name of the PML module that it selected. Once the modex
>>>>>>>>> has
>>>>>>>>> exchanged
>>>>>>>>> that info, it then loops across all procs in the job to
>>>>>>>>> check their
>>>>>>>>> selection, and aborts if any proc picked a different PML
>>>>>>>>> module.
>>>>>>>>>
>>>>>>>>> All well and good...assuming that procs actually -can- choose
>>>>>>>>> different PML
>>>>>>>>> modules and hence create an "abort" scenario. However, if I
>>>>>>>>> look
>>>>>>>>> inside the
>>>>>>>>> PML's at their selection logic, I find that a proc can ONLY
>>>>>>>>> pick a
>>>>>>>>> module
>>>>>>>>> other than ob1 if:
>>>>>>>>>
>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or
>>>>>>>>> by using a
>>>>>>>>> module specific mca param to adjust its priority. In this
>>>>>>>>> case,
>>>>>>>>> since the
>>>>>>>>> mca param is propagated, ALL procs have no choice but to
>>>>>>>>> pick that
>>>>>>>>> same
>>>>>>>>> module, so that can't cause us to abort (we will have already
>>>>>>>>> returned an
>>>>>>>>> error and aborted if the specified module can't run).
>>>>>>>>>
>>>>>>>>> 2. the pml/cm module detects that an MTL module was
>>>>>>>>> selected, and
>>>>>>>>> that it is
>>>>>>>>> other than "psm". In this case, the CM module will be selected
>>>>>>>>> because its
>>>>>>>>> default priority is higher than that of OB1.
>>>>>>>>>
>>>>>>>>> In looking deeper into the MTL selection logic, it appears
>>>>>>>>> to me
>>>>>>>>> that you
>>>>>>>>> either have the required capability or you don't. I can see
>>>>>>>>> that in
>>>>>>>>> some
>>>>>>>>> environments (e.g., rsh across unmanaged collections of
>>>>>>>>> machines),
>>>>>>>>> it might
>>>>>>>>> be possible for someone to launch across a set of machines
>>>>>>>>> where
>>>>>>>>> some do and
>>>>>>>>> some don't have the required support. However, in all other
>>>>>>>>> cases,
>>>>>>>>> this will
>>>>>>>>> be homogeneous across the system.
>>>>>>>>>
>>>>>>>>> Given this analysis (and someone more familiar with the PML
>>>>>>>>> should
>>>>>>>>> feel free
>>>>>>>>> to confirm or correct it), it seems to me that this could be
>>>>>>>>> streamlined via
>>>>>>>>> one or more means:
>>>>>>>>>
>>>>>>>>> 1. at the most, we could have rank=0 add the PML module name
>>>>>>>>> to the
>>>>>>>>> modex,
>>>>>>>>> and other procs simply check it against their own and return
>>>>>>>>> an
>>>>>>>>> error if
>>>>>>>>> they differ. This accomplishes the identical functionality
>>>>>>>>> to what
>>>>>>>>> we have
>>>>>>>>> today, but with much less info in the modex.
>>>>>>>>>
>>>>>>>>> 2. we could eliminate this info from the modex altogether by
>>>>>>>>> requiring the
>>>>>>>>> user to specify the PML module if they want something other
>>>>>>>>> than the
>>>>>>>>> default
>>>>>>>>> OB1. In this case, there can be no confusion over what each
>>>>>>>>> proc is
>>>>>>>>> to use.
>>>>>>>>> The CM module will attempt to init the MTL - if it cannot do
>>>>>>>>> so,
>>>>>>>>> then the
>>>>>>>>> job will return the correct error and tell the user that CM/
>>>>>>>>> MTL
>>>>>>>>> support is
>>>>>>>>> unavailable.
>>>>>>>>>
>>>>>>>>> 3. we could again eliminate the info by not inserting it
>>>>>>>>> into the
>>>>>>>>> modex if
>>>>>>>>> (a) the default PML module is selected, or (b) the user
>>>>>>>>> specified
>>>>>>>>> the PML
>>>>>>>>> module to be used. In the first case, each proc can simply
>>>>>>>>> check to
>>>>>>>>> see if
>>>>>>>>> they picked the default - if not, then we can insert the
>>>>>>>>> info to
>>>>>>>>> indicate
>>>>>>>>> the difference. Thus, in the "standard" case, no info will be
>>>>>>>>> inserted.
>>>>>>>>>
>>>>>>>>> In the second case, we will already get an error if the
>>>>>>>>> specified
>>>>>>>>> PML module
>>>>>>>>> could not be used. Hence, the modex check provides no
>>>>>>>>> additional
>>>>>>>>> info or
>>>>>>>>> value.
>>>>>>>>>
>>>>>>>>> I understand the motivation to support automation. However,
>>>>>>>>> in this
>>>>>>>>> case,
>>>>>>>>> the automation actually doesn't seem to buy us very much,
>>>>>>>>> and it isn't
>>>>>>>>> coming "free". So perhaps some change in how this is done
>>>>>>>>> would be
>>>>>>>>> in order?
>>>>>>>>>
>>>>>>>>> Ralph
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems