Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] PML selection logic
From: Shipman, Galen M. (gshipman_at_[hidden])
Date: 2008-06-23 16:18:47


I concur
- galen

On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:

> That sounds like a reasonable plan to me.
>
> Brian
>
> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>
>> Okay, so let's explore an alternative that preserves the support
>> you are
>> seeking for the "ignorant user", but doesn't penalize everyone
>> else. What we
>> could do is simply set things up so that:
>>
>> 1. if -mca plm xyz is provided, then no modex data is added
>>
>> 2. if it is not provided, then only rank=0 inserts the data. All
>> other procs
>> simply check their own selection against the one given by rank=0
>>
>> Now, if a knowledgeable user or sys admin specifies what to use
>> for their
>> system, we won't penalize their startup time. A user who doesn't
>> know what
>> to do gets to run, albeit less scalably on startup.
>>
>> Looking forward from there, we can look to a day where failing to
>> initialize
>> something that exists on the system could be detected in some
>> other fashion,
>> letting the local proc abort since it would know that other procs
>> that
>> detected similar capabilities may well have selected that PML. For
>> now,
>> though, this would solve the problem.
>>
>> Make sense?
>> Ralph
>>
>>
>>
>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>>
>>> The problem is that we default to OB1, but that's not the right
>>> choice for
>>> some platforms (like Pathscale / PSM), where there's a huge
>>> performance
>>> hit for using OB1. So we run into a situation where user
>>> installs Open
>>> MPI, starts running, gets horrible performance, bad mouths Open
>>> MPI, and
>>> now we're in that game again. Yeah, the sys admin should know
>>> what to do,
>>> but it doesn't always work that way.
>>>
>>> Brian
>>>
>>>
>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>
>>>> My fault - I should be more precise in my language. ;-/
>>>>
>>>> #1 is not adequate, IMHO, as it forces us to -always- do a
>>>> modex. It seems
>>>> to me that a simpler solution to what you describe is for the
>>>> user to
>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you
>>>> could deal
>>>> with the failed-to-initialize problem cleanly by having the proc
>>>> directly
>>>> abort.
>>>>
>>>> Again, sometimes I think we attempt to automate too many things.
>>>> This seems
>>>> like a pretty clear case where you know what you want - the sys
>>>> admin, if
>>>> nobody else, can certainly set that mca param in the default
>>>> param file!
>>>>
>>>> Otherwise, it seems to me that you are relying on the modex to
>>>> detect that
>>>> your proc failed to init the correct subsystem. I hate to force
>>>> a modex just
>>>> for that - if so, then perhaps this could again be a settable
>>>> option to
>>>> avoid requiring non-scalable behavior for those of us who want
>>>> scalability?
>>>>
>>>>
>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>> wrote:
>>>>
>>>>> The selection code was added because frequently high speed
>>>>> interconnects
>>>>> fail to initialize properly due to random stuff happening (yes,
>>>>> that's a
>>>>> horrible statement, but true). We ran into a situation with
>>>>> some really
>>>>> flaky machines where most of the processes would chose CM, but
>>>>> a couple
>>>>> would fail to initialize the MTL and therefore chose OB1. This
>>>>> lead to a
>>>>> hang situation, which is the worst of the worst.
>>>>>
>>>>> I think #1 is adequate, although it doesn't handle spawn
>>>>> particularly
>>>>> well. And spawn is generally used in environments where such
>>>>> network
>>>>> mismatches are most likely to occur.
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>
>>>>>> Since my goal is to eliminate the modex completely for managed
>>>>>> installations, could you give me a brief understanding of this
>>>>>> eventual PML
>>>>>> selection logic? It would help to hear an example of how and
>>>>>> why different
>>>>>> procs could get different answers - and why we would want to
>>>>>> allow them to
>>>>>> do so.
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller"
>>>>>> <bouteill_at_[hidden]> wrote:
>>>>>>
>>>>>>> The first approach sounds fair enough to me. We should avoid
>>>>>>> 2 and 3
>>>>>>> as the pml selection mechanism used to be
>>>>>>> more complex before we reduced it to accommodate a major
>>>>>>> design bug in
>>>>>>> the BTL selection process. When using the complete PML
>>>>>>> selection, BTL
>>>>>>> would be initialized several times, leading to a variety of
>>>>>>> bugs.
>>>>>>> Eventually the PML selection should return to its old self,
>>>>>>> when the
>>>>>>> BTL bug gets fixed.
>>>>>>>
>>>>>>> Aurelien
>>>>>>>
>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>>
>>>>>>>> Yo all
>>>>>>>>
>>>>>>>> I've been doing further research into the modex and came across
>>>>>>>> something I
>>>>>>>> don't fully understand. It seems we have each process insert
>>>>>>>> into
>>>>>>>> the modex
>>>>>>>> the name of the PML module that it selected. Once the modex has
>>>>>>>> exchanged
>>>>>>>> that info, it then loops across all procs in the job to
>>>>>>>> check their
>>>>>>>> selection, and aborts if any proc picked a different PML
>>>>>>>> module.
>>>>>>>>
>>>>>>>> All well and good...assuming that procs actually -can- choose
>>>>>>>> different PML
>>>>>>>> modules and hence create an "abort" scenario. However, if I
>>>>>>>> look
>>>>>>>> inside the
>>>>>>>> PML's at their selection logic, I find that a proc can ONLY
>>>>>>>> pick a
>>>>>>>> module
>>>>>>>> other than ob1 if:
>>>>>>>>
>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or
>>>>>>>> by using a
>>>>>>>> module specific mca param to adjust its priority. In this case,
>>>>>>>> since the
>>>>>>>> mca param is propagated, ALL procs have no choice but to
>>>>>>>> pick that
>>>>>>>> same
>>>>>>>> module, so that can't cause us to abort (we will have already
>>>>>>>> returned an
>>>>>>>> error and aborted if the specified module can't run).
>>>>>>>>
>>>>>>>> 2. the pml/cm module detects that an MTL module was
>>>>>>>> selected, and
>>>>>>>> that it is
>>>>>>>> other than "psm". In this case, the CM module will be selected
>>>>>>>> because its
>>>>>>>> default priority is higher than that of OB1.
>>>>>>>>
>>>>>>>> In looking deeper into the MTL selection logic, it appears
>>>>>>>> to me
>>>>>>>> that you
>>>>>>>> either have the required capability or you don't. I can see
>>>>>>>> that in
>>>>>>>> some
>>>>>>>> environments (e.g., rsh across unmanaged collections of
>>>>>>>> machines),
>>>>>>>> it might
>>>>>>>> be possible for someone to launch across a set of machines
>>>>>>>> where
>>>>>>>> some do and
>>>>>>>> some don't have the required support. However, in all other
>>>>>>>> cases,
>>>>>>>> this will
>>>>>>>> be homogeneous across the system.
>>>>>>>>
>>>>>>>> Given this analysis (and someone more familiar with the PML
>>>>>>>> should
>>>>>>>> feel free
>>>>>>>> to confirm or correct it), it seems to me that this could be
>>>>>>>> streamlined via
>>>>>>>> one or more means:
>>>>>>>>
>>>>>>>> 1. at the most, we could have rank=0 add the PML module name
>>>>>>>> to the
>>>>>>>> modex,
>>>>>>>> and other procs simply check it against their own and return an
>>>>>>>> error if
>>>>>>>> they differ. This accomplishes the identical functionality
>>>>>>>> to what
>>>>>>>> we have
>>>>>>>> today, but with much less info in the modex.
>>>>>>>>
>>>>>>>> 2. we could eliminate this info from the modex altogether by
>>>>>>>> requiring the
>>>>>>>> user to specify the PML module if they want something other
>>>>>>>> than the
>>>>>>>> default
>>>>>>>> OB1. In this case, there can be no confusion over what each
>>>>>>>> proc is
>>>>>>>> to use.
>>>>>>>> The CM module will attempt to init the MTL - if it cannot do
>>>>>>>> so,
>>>>>>>> then the
>>>>>>>> job will return the correct error and tell the user that CM/MTL
>>>>>>>> support is
>>>>>>>> unavailable.
>>>>>>>>
>>>>>>>> 3. we could again eliminate the info by not inserting it
>>>>>>>> into the
>>>>>>>> modex if
>>>>>>>> (a) the default PML module is selected, or (b) the user
>>>>>>>> specified
>>>>>>>> the PML
>>>>>>>> module to be used. In the first case, each proc can simply
>>>>>>>> check to
>>>>>>>> see if
>>>>>>>> they picked the default - if not, then we can insert the
>>>>>>>> info to
>>>>>>>> indicate
>>>>>>>> the difference. Thus, in the "standard" case, no info will be
>>>>>>>> inserted.
>>>>>>>>
>>>>>>>> In the second case, we will already get an error if the
>>>>>>>> specified
>>>>>>>> PML module
>>>>>>>> could not be used. Hence, the modex check provides no
>>>>>>>> additional
>>>>>>>> info or
>>>>>>>> value.
>>>>>>>>
>>>>>>>> I understand the motivation to support automation. However,
>>>>>>>> in this
>>>>>>>> case,
>>>>>>>> the automation actually doesn't seem to buy us very much,
>>>>>>>> and it isn't
>>>>>>>> coming "free". So perhaps some change in how this is done
>>>>>>>> would be
>>>>>>>> in order?
>>>>>>>>
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel