Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] PML selection logic
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-06-29 04:44:43


We can also make few different paramfiles for typical setups ( large cluster
/ minimum LT / max BW e.t.c )
the desired paramfile can be chosen by configure flag and be placed in *
$prefix/etc/openmpi-mca-params.conf*

On Sat, Jun 28, 2008 at 3:55 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> Agreed. I have a few ideas in this direction as well (random thoughts that
> might as well be transcribed somewhere):
>
> - some kind of configure --enable-large-system (whatever) option is a Good
> Thing
>
> - it would be good if the configure option simply set [MCA parameter?]
> defaults wherever possible (vs. #if-selecting code). I think one of the
> biggest lessons learned from Open MPI is that everyone's setup is different
> -- having the ability to mix and match various run-time options, while not
> widely used, is absolutely critical in some scenarios. So it might be good
> if --enable-large-system sets a bunch of default parameters that some
> sysadmins may still want/need to override.
>
> - decision to run the modex: I haven't seen all of Ralph's work in this
> area, but I wonder if it's similar to the MPI handle parameter checks: it
> could be a multi-value MCA parameter, such as: "never", "always",
> "when-ompi-determines-its-necessary", etc., where the last value can use
> multiple criteria to know if it's necessary to do a modex (e.g., job size,
> when spawn occurs, whether the "pml" [or other critical] MCA param[s] were
> specified, ...etc.).
>
>
>
> On Jun 26, 2008, at 9:26 AM, Ralph H Castain wrote:
>
> Just to complete this thread...
>>
>> Brian raised a very good point, so we identified it on the weekly telecon
>> as
>> a subject that really should be discussed at next week's technical
>> meeting.
>> I think we can find a reasonable answer, but there are several ways it can
>> be done. So rather than doing our usual piecemeal approach to the
>> solution,
>> it makes sense to begin talking about a more holistic design for
>> accommodating both needs.
>>
>> Thanks Brian for pointing out the bigger picture.
>> Ralph
>>
>>
>>
>> On 6/24/08 8:22 AM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>>
>> yeah, that could be a problem, but it's such a minority case and we've got
>>> to draw the line somewhere.
>>>
>>> Of course, it seems like this is a never ending battle between two
>>> opposing forces... The desire to do the "right thing" all the time at
>>> small and medium scale and the desire to scale out to the "big thing".
>>> It seems like in the quest to kill off the modex, we've run into these
>>> pretty often.
>>>
>>> The modex doesn't hurt us at small scale (indeed, we're probably ok with
>>> the routed communication pattern up to 512 nodes or so if we don't do
>>> anything stupid, maybe further). Is it time to admit defeat in this
>>> argument and have a configure option that turns off the modex (at the
>>> cost
>>> of some of these correctness checks) for the large machines, but keeps
>>> things simple for the common case? I'm sure there are other things where
>>> this will come up, so perhaps a --enable-large-scale? Maybe it's a dumb
>>> idea, but it seems like we've made a lot of compromises lately around
>>> this, where no one ends up really happy with the solution :/.
>>>
>>> Brian
>>>
>>>
>>> On Tue, 24 Jun 2008, George Bosilca wrote:
>>>
>>> Brian hinted a possible bug in one of his replies. How does this work in
>>>> the
>>>> case of dynamic processes? We can envision several scenarios, but lets
>>>> take a
>>>> simple: 2 jobs that get connected with connect/accept. One might publish
>>>> the
>>>> PML name (simply because the -mca argument was on) and one might not?
>>>>
>>>> george.
>>>>
>>>> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:
>>>>
>>>> Also sounds good to me.
>>>>>
>>>>> Note that the most difficult part of the forward-looking plan is that
>>>>> we
>>>>> usually can't tell the difference between "something failed to
>>>>> initialize"
>>>>> and "you don't have support for feature X".
>>>>>
>>>>> I like the general philosophy of: running out of the box always works
>>>>> just
>>>>> fine, but if you/the sysadmin is smart, you can get performance
>>>>> improvements.
>>>>>
>>>>>
>>>>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>>>>>
>>>>> I concur
>>>>>> - galen
>>>>>>
>>>>>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>>>>>>
>>>>>> That sounds like a reasonable plan to me.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>
>>>>>>> Okay, so let's explore an alternative that preserves the support you
>>>>>>>> are
>>>>>>>> seeking for the "ignorant user", but doesn't penalize everyone else.
>>>>>>>> What we
>>>>>>>> could do is simply set things up so that:
>>>>>>>>
>>>>>>>> 1. if -mca plm xyz is provided, then no modex data is added
>>>>>>>>
>>>>>>>> 2. if it is not provided, then only rank=0 inserts the data. All
>>>>>>>> other
>>>>>>>> procs
>>>>>>>> simply check their own selection against the one given by rank=0
>>>>>>>>
>>>>>>>> Now, if a knowledgeable user or sys admin specifies what to use for
>>>>>>>> their
>>>>>>>> system, we won't penalize their startup time. A user who doesn't
>>>>>>>> know
>>>>>>>> what
>>>>>>>> to do gets to run, albeit less scalably on startup.
>>>>>>>>
>>>>>>>> Looking forward from there, we can look to a day where failing to
>>>>>>>> initialize
>>>>>>>> something that exists on the system could be detected in some other
>>>>>>>> fashion,
>>>>>>>> letting the local proc abort since it would know that other procs
>>>>>>>> that
>>>>>>>> detected similar capabilities may well have selected that PML. For
>>>>>>>> now,
>>>>>>>> though, this would solve the problem.
>>>>>>>>
>>>>>>>> Make sense?
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> The problem is that we default to OB1, but that's not the right
>>>>>>>>> choice
>>>>>>>>> for
>>>>>>>>> some platforms (like Pathscale / PSM), where there's a huge
>>>>>>>>> performance
>>>>>>>>> hit for using OB1. So we run into a situation where user installs
>>>>>>>>> Open
>>>>>>>>> MPI, starts running, gets horrible performance, bad mouths Open
>>>>>>>>> MPI,
>>>>>>>>> and
>>>>>>>>> now we're in that game again. Yeah, the sys admin should know what
>>>>>>>>> to
>>>>>>>>> do,
>>>>>>>>> but it doesn't always work that way.
>>>>>>>>>
>>>>>>>>> Brian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>>>
>>>>>>>>> My fault - I should be more precise in my language. ;-/
>>>>>>>>>>
>>>>>>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a modex.
>>>>>>>>>> It
>>>>>>>>>> seems
>>>>>>>>>> to me that a simpler solution to what you describe is for the user
>>>>>>>>>> to
>>>>>>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you
>>>>>>>>>> could
>>>>>>>>>> deal
>>>>>>>>>> with the failed-to-initialize problem cleanly by having the proc
>>>>>>>>>> directly
>>>>>>>>>> abort.
>>>>>>>>>>
>>>>>>>>>> Again, sometimes I think we attempt to automate too many things.
>>>>>>>>>> This
>>>>>>>>>> seems
>>>>>>>>>> like a pretty clear case where you know what you want - the sys
>>>>>>>>>> admin,
>>>>>>>>>> if
>>>>>>>>>> nobody else, can certainly set that mca param in the default param
>>>>>>>>>> file!
>>>>>>>>>>
>>>>>>>>>> Otherwise, it seems to me that you are relying on the modex to
>>>>>>>>>> detect
>>>>>>>>>> that
>>>>>>>>>> your proc failed to init the correct subsystem. I hate to force a
>>>>>>>>>> modex just
>>>>>>>>>> for that - if so, then perhaps this could again be a settable
>>>>>>>>>> option
>>>>>>>>>> to
>>>>>>>>>> avoid requiring non-scalable behavior for those of us who want
>>>>>>>>>> scalability?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> The selection code was added because frequently high speed
>>>>>>>>>>> interconnects
>>>>>>>>>>> fail to initialize properly due to random stuff happening (yes,
>>>>>>>>>>> that's a
>>>>>>>>>>> horrible statement, but true). We ran into a situation with some
>>>>>>>>>>> really
>>>>>>>>>>> flaky machines where most of the processes would chose CM, but a
>>>>>>>>>>> couple
>>>>>>>>>>> would fail to initialize the MTL and therefore chose OB1. This
>>>>>>>>>>> lead
>>>>>>>>>>> to a
>>>>>>>>>>> hang situation, which is the worst of the worst.
>>>>>>>>>>>
>>>>>>>>>>> I think #1 is adequate, although it doesn't handle spawn
>>>>>>>>>>> particularly
>>>>>>>>>>> well. And spawn is generally used in environments where such
>>>>>>>>>>> network
>>>>>>>>>>> mismatches are most likely to occur.
>>>>>>>>>>>
>>>>>>>>>>> Brian
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>>>>>
>>>>>>>>>>> Since my goal is to eliminate the modex completely for managed
>>>>>>>>>>>> installations, could you give me a brief understanding of this
>>>>>>>>>>>> eventual PML
>>>>>>>>>>>> selection logic? It would help to hear an example of how and why
>>>>>>>>>>>> different
>>>>>>>>>>>> procs could get different answers - and why we would want to
>>>>>>>>>>>> allow
>>>>>>>>>>>> them to
>>>>>>>>>>>> do so.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Ralph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <
>>>>>>>>>>>> bouteill_at_[hidden]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> The first approach sounds fair enough to me. We should avoid 2
>>>>>>>>>>>>> and
>>>>>>>>>>>>> 3
>>>>>>>>>>>>> as the pml selection mechanism used to be
>>>>>>>>>>>>> more complex before we reduced it to accommodate a major design
>>>>>>>>>>>>> bug
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the BTL selection process. When using the complete PML
>>>>>>>>>>>>> selection,
>>>>>>>>>>>>> BTL
>>>>>>>>>>>>> would be initialized several times, leading to a variety of
>>>>>>>>>>>>> bugs.
>>>>>>>>>>>>> Eventually the PML selection should return to its old self,
>>>>>>>>>>>>> when
>>>>>>>>>>>>> the
>>>>>>>>>>>>> BTL bug gets fixed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Aurelien
>>>>>>>>>>>>>
>>>>>>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yo all
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've been doing further research into the modex and came
>>>>>>>>>>>>>> across
>>>>>>>>>>>>>> something I
>>>>>>>>>>>>>> don't fully understand. It seems we have each process insert
>>>>>>>>>>>>>> into
>>>>>>>>>>>>>> the modex
>>>>>>>>>>>>>> the name of the PML module that it selected. Once the modex
>>>>>>>>>>>>>> has
>>>>>>>>>>>>>> exchanged
>>>>>>>>>>>>>> that info, it then loops across all procs in the job to check
>>>>>>>>>>>>>> their
>>>>>>>>>>>>>> selection, and aborts if any proc picked a different PML
>>>>>>>>>>>>>> module.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All well and good...assuming that procs actually -can- choose
>>>>>>>>>>>>>> different PML
>>>>>>>>>>>>>> modules and hence create an "abort" scenario. However, if I
>>>>>>>>>>>>>> look
>>>>>>>>>>>>>> inside the
>>>>>>>>>>>>>> PML's at their selection logic, I find that a proc can ONLY
>>>>>>>>>>>>>> pick a
>>>>>>>>>>>>>> module
>>>>>>>>>>>>>> other than ob1 if:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or by
>>>>>>>>>>>>>> using a
>>>>>>>>>>>>>> module specific mca param to adjust its priority. In this
>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> since the
>>>>>>>>>>>>>> mca param is propagated, ALL procs have no choice but to pick
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>> module, so that can't cause us to abort (we will have already
>>>>>>>>>>>>>> returned an
>>>>>>>>>>>>>> error and aborted if the specified module can't run).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. the pml/cm module detects that an MTL module was selected,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> that it is
>>>>>>>>>>>>>> other than "psm". In this case, the CM module will be selected
>>>>>>>>>>>>>> because its
>>>>>>>>>>>>>> default priority is higher than that of OB1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In looking deeper into the MTL selection logic, it appears to
>>>>>>>>>>>>>> me
>>>>>>>>>>>>>> that you
>>>>>>>>>>>>>> either have the required capability or you don't. I can see
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> environments (e.g., rsh across unmanaged collections of
>>>>>>>>>>>>>> machines),
>>>>>>>>>>>>>> it might
>>>>>>>>>>>>>> be possible for someone to launch across a set of machines
>>>>>>>>>>>>>> where
>>>>>>>>>>>>>> some do and
>>>>>>>>>>>>>> some don't have the required support. However, in all other
>>>>>>>>>>>>>> cases,
>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>> be homogeneous across the system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Given this analysis (and someone more familiar with the PML
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>> feel free
>>>>>>>>>>>>>> to confirm or correct it), it seems to me that this could be
>>>>>>>>>>>>>> streamlined via
>>>>>>>>>>>>>> one or more means:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. at the most, we could have rank=0 add the PML module name
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> modex,
>>>>>>>>>>>>>> and other procs simply check it against their own and return
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>> they differ. This accomplishes the identical functionality to
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> we have
>>>>>>>>>>>>>> today, but with much less info in the modex.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. we could eliminate this info from the modex altogether by
>>>>>>>>>>>>>> requiring the
>>>>>>>>>>>>>> user to specify the PML module if they want something other
>>>>>>>>>>>>>> than
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> default
>>>>>>>>>>>>>> OB1. In this case, there can be no confusion over what each
>>>>>>>>>>>>>> proc
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> to use.
>>>>>>>>>>>>>> The CM module will attempt to init the MTL - if it cannot do
>>>>>>>>>>>>>> so,
>>>>>>>>>>>>>> then the
>>>>>>>>>>>>>> job will return the correct error and tell the user that
>>>>>>>>>>>>>> CM/MTL
>>>>>>>>>>>>>> support is
>>>>>>>>>>>>>> unavailable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 3. we could again eliminate the info by not inserting it into
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> modex if
>>>>>>>>>>>>>> (a) the default PML module is selected, or (b) the user
>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>> the PML
>>>>>>>>>>>>>> module to be used. In the first case, each proc can simply
>>>>>>>>>>>>>> check
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>> they picked the default - if not, then we can insert the info
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> indicate
>>>>>>>>>>>>>> the difference. Thus, in the "standard" case, no info will be
>>>>>>>>>>>>>> inserted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the second case, we will already get an error if the
>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>> PML module
>>>>>>>>>>>>>> could not be used. Hence, the modex check provides no
>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>> info or
>>>>>>>>>>>>>> value.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I understand the motivation to support automation. However, in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> the automation actually doesn't seem to buy us very much, and
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>> coming "free". So perhaps some change in how this is done
>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>> in order?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>