Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] PML selection logic
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-06-24 10:16:10


Brian hinted a possible bug in one of his replies. How does this work
in the case of dynamic processes? We can envision several scenarios,
but lets take a simple: 2 jobs that get connected with connect/accept.
One might publish the PML name (simply because the -mca argument was
on) and one might not?

   george.

On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:

> Also sounds good to me.
>
> Note that the most difficult part of the forward-looking plan is
> that we usually can't tell the difference between "something failed
> to initialize" and "you don't have support for feature X".
>
> I like the general philosophy of: running out of the box always
> works just fine, but if you/the sysadmin is smart, you can get
> performance improvements.
>
>
> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>
>> I concur
>> - galen
>>
>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>>
>>> That sounds like a reasonable plan to me.
>>>
>>> Brian
>>>
>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>
>>>> Okay, so let's explore an alternative that preserves the support
>>>> you are
>>>> seeking for the "ignorant user", but doesn't penalize everyone
>>>> else. What we
>>>> could do is simply set things up so that:
>>>>
>>>> 1. if -mca plm xyz is provided, then no modex data is added
>>>>
>>>> 2. if it is not provided, then only rank=0 inserts the data. All
>>>> other procs
>>>> simply check their own selection against the one given by rank=0
>>>>
>>>> Now, if a knowledgeable user or sys admin specifies what to use
>>>> for their
>>>> system, we won't penalize their startup time. A user who doesn't
>>>> know what
>>>> to do gets to run, albeit less scalably on startup.
>>>>
>>>> Looking forward from there, we can look to a day where failing to
>>>> initialize
>>>> something that exists on the system could be detected in some
>>>> other fashion,
>>>> letting the local proc abort since it would know that other procs
>>>> that
>>>> detected similar capabilities may well have selected that PML.
>>>> For now,
>>>> though, this would solve the problem.
>>>>
>>>> Make sense?
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>> wrote:
>>>>
>>>>> The problem is that we default to OB1, but that's not the right
>>>>> choice for
>>>>> some platforms (like Pathscale / PSM), where there's a huge
>>>>> performance
>>>>> hit for using OB1. So we run into a situation where user
>>>>> installs Open
>>>>> MPI, starts running, gets horrible performance, bad mouths Open
>>>>> MPI, and
>>>>> now we're in that game again. Yeah, the sys admin should know
>>>>> what to do,
>>>>> but it doesn't always work that way.
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>
>>>>>> My fault - I should be more precise in my language. ;-/
>>>>>>
>>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a
>>>>>> modex. It seems
>>>>>> to me that a simpler solution to what you describe is for the
>>>>>> user to
>>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you
>>>>>> could deal
>>>>>> with the failed-to-initialize problem cleanly by having the
>>>>>> proc directly
>>>>>> abort.
>>>>>>
>>>>>> Again, sometimes I think we attempt to automate too many
>>>>>> things. This seems
>>>>>> like a pretty clear case where you know what you want - the sys
>>>>>> admin, if
>>>>>> nobody else, can certainly set that mca param in the default
>>>>>> param file!
>>>>>>
>>>>>> Otherwise, it seems to me that you are relying on the modex to
>>>>>> detect that
>>>>>> your proc failed to init the correct subsystem. I hate to force
>>>>>> a modex just
>>>>>> for that - if so, then perhaps this could again be a settable
>>>>>> option to
>>>>>> avoid requiring non-scalable behavior for those of us who want
>>>>>> scalability?
>>>>>>
>>>>>>
>>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>> The selection code was added because frequently high speed
>>>>>>> interconnects
>>>>>>> fail to initialize properly due to random stuff happening
>>>>>>> (yes, that's a
>>>>>>> horrible statement, but true). We ran into a situation with
>>>>>>> some really
>>>>>>> flaky machines where most of the processes would chose CM, but
>>>>>>> a couple
>>>>>>> would fail to initialize the MTL and therefore chose OB1.
>>>>>>> This lead to a
>>>>>>> hang situation, which is the worst of the worst.
>>>>>>>
>>>>>>> I think #1 is adequate, although it doesn't handle spawn
>>>>>>> particularly
>>>>>>> well. And spawn is generally used in environments where such
>>>>>>> network
>>>>>>> mismatches are most likely to occur.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>
>>>>>>>> Since my goal is to eliminate the modex completely for managed
>>>>>>>> installations, could you give me a brief understanding of
>>>>>>>> this eventual PML
>>>>>>>> selection logic? It would help to hear an example of how and
>>>>>>>> why different
>>>>>>>> procs could get different answers - and why we would want to
>>>>>>>> allow them to
>>>>>>>> do so.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <bouteill_at_[hidden]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> The first approach sounds fair enough to me. We should avoid
>>>>>>>>> 2 and 3
>>>>>>>>> as the pml selection mechanism used to be
>>>>>>>>> more complex before we reduced it to accommodate a major
>>>>>>>>> design bug in
>>>>>>>>> the BTL selection process. When using the complete PML
>>>>>>>>> selection, BTL
>>>>>>>>> would be initialized several times, leading to a variety of
>>>>>>>>> bugs.
>>>>>>>>> Eventually the PML selection should return to its old self,
>>>>>>>>> when the
>>>>>>>>> BTL bug gets fixed.
>>>>>>>>>
>>>>>>>>> Aurelien
>>>>>>>>>
>>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>>>>
>>>>>>>>>> Yo all
>>>>>>>>>>
>>>>>>>>>> I've been doing further research into the modex and came
>>>>>>>>>> across
>>>>>>>>>> something I
>>>>>>>>>> don't fully understand. It seems we have each process
>>>>>>>>>> insert into
>>>>>>>>>> the modex
>>>>>>>>>> the name of the PML module that it selected. Once the modex
>>>>>>>>>> has
>>>>>>>>>> exchanged
>>>>>>>>>> that info, it then loops across all procs in the job to
>>>>>>>>>> check their
>>>>>>>>>> selection, and aborts if any proc picked a different PML
>>>>>>>>>> module.
>>>>>>>>>>
>>>>>>>>>> All well and good...assuming that procs actually -can- choose
>>>>>>>>>> different PML
>>>>>>>>>> modules and hence create an "abort" scenario. However, if I
>>>>>>>>>> look
>>>>>>>>>> inside the
>>>>>>>>>> PML's at their selection logic, I find that a proc can ONLY
>>>>>>>>>> pick a
>>>>>>>>>> module
>>>>>>>>>> other than ob1 if:
>>>>>>>>>>
>>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or
>>>>>>>>>> by using a
>>>>>>>>>> module specific mca param to adjust its priority. In this
>>>>>>>>>> case,
>>>>>>>>>> since the
>>>>>>>>>> mca param is propagated, ALL procs have no choice but to
>>>>>>>>>> pick that
>>>>>>>>>> same
>>>>>>>>>> module, so that can't cause us to abort (we will have already
>>>>>>>>>> returned an
>>>>>>>>>> error and aborted if the specified module can't run).
>>>>>>>>>>
>>>>>>>>>> 2. the pml/cm module detects that an MTL module was
>>>>>>>>>> selected, and
>>>>>>>>>> that it is
>>>>>>>>>> other than "psm". In this case, the CM module will be
>>>>>>>>>> selected
>>>>>>>>>> because its
>>>>>>>>>> default priority is higher than that of OB1.
>>>>>>>>>>
>>>>>>>>>> In looking deeper into the MTL selection logic, it appears
>>>>>>>>>> to me
>>>>>>>>>> that you
>>>>>>>>>> either have the required capability or you don't. I can see
>>>>>>>>>> that in
>>>>>>>>>> some
>>>>>>>>>> environments (e.g., rsh across unmanaged collections of
>>>>>>>>>> machines),
>>>>>>>>>> it might
>>>>>>>>>> be possible for someone to launch across a set of machines
>>>>>>>>>> where
>>>>>>>>>> some do and
>>>>>>>>>> some don't have the required support. However, in all other
>>>>>>>>>> cases,
>>>>>>>>>> this will
>>>>>>>>>> be homogeneous across the system.
>>>>>>>>>>
>>>>>>>>>> Given this analysis (and someone more familiar with the PML
>>>>>>>>>> should
>>>>>>>>>> feel free
>>>>>>>>>> to confirm or correct it), it seems to me that this could be
>>>>>>>>>> streamlined via
>>>>>>>>>> one or more means:
>>>>>>>>>>
>>>>>>>>>> 1. at the most, we could have rank=0 add the PML module
>>>>>>>>>> name to the
>>>>>>>>>> modex,
>>>>>>>>>> and other procs simply check it against their own and
>>>>>>>>>> return an
>>>>>>>>>> error if
>>>>>>>>>> they differ. This accomplishes the identical functionality
>>>>>>>>>> to what
>>>>>>>>>> we have
>>>>>>>>>> today, but with much less info in the modex.
>>>>>>>>>>
>>>>>>>>>> 2. we could eliminate this info from the modex altogether by
>>>>>>>>>> requiring the
>>>>>>>>>> user to specify the PML module if they want something other
>>>>>>>>>> than the
>>>>>>>>>> default
>>>>>>>>>> OB1. In this case, there can be no confusion over what each
>>>>>>>>>> proc is
>>>>>>>>>> to use.
>>>>>>>>>> The CM module will attempt to init the MTL - if it cannot
>>>>>>>>>> do so,
>>>>>>>>>> then the
>>>>>>>>>> job will return the correct error and tell the user that CM/
>>>>>>>>>> MTL
>>>>>>>>>> support is
>>>>>>>>>> unavailable.
>>>>>>>>>>
>>>>>>>>>> 3. we could again eliminate the info by not inserting it
>>>>>>>>>> into the
>>>>>>>>>> modex if
>>>>>>>>>> (a) the default PML module is selected, or (b) the user
>>>>>>>>>> specified
>>>>>>>>>> the PML
>>>>>>>>>> module to be used. In the first case, each proc can simply
>>>>>>>>>> check to
>>>>>>>>>> see if
>>>>>>>>>> they picked the default - if not, then we can insert the
>>>>>>>>>> info to
>>>>>>>>>> indicate
>>>>>>>>>> the difference. Thus, in the "standard" case, no info will be
>>>>>>>>>> inserted.
>>>>>>>>>>
>>>>>>>>>> In the second case, we will already get an error if the
>>>>>>>>>> specified
>>>>>>>>>> PML module
>>>>>>>>>> could not be used. Hence, the modex check provides no
>>>>>>>>>> additional
>>>>>>>>>> info or
>>>>>>>>>> value.
>>>>>>>>>>
>>>>>>>>>> I understand the motivation to support automation. However,
>>>>>>>>>> in this
>>>>>>>>>> case,
>>>>>>>>>> the automation actually doesn't seem to buy us very much,
>>>>>>>>>> and it isn't
>>>>>>>>>> coming "free". So perhaps some change in how this is done
>>>>>>>>>> would be
>>>>>>>>>> in order?
>>>>>>>>>>
>>>>>>>>>> Ralph
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s