Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] PML selection logic
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-06-29 04:44:43


We can also make few different paramfiles for typical setups ( large cluster
/ minimum LT / max BW e.t.c )
the desired paramfile can be chosen by configure flag and be placed in *
$prefix/etc/openmpi-mca-params.conf*

On Sat, Jun 28, 2008 at 3:55 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> Agreed. I have a few ideas in this direction as well (random thoughts that
> might as well be transcribed somewhere):
>
> - some kind of configure --enable-large-system (whatever) option is a Good
> Thing
>
> - it would be good if the configure option simply set [MCA parameter?]
> defaults wherever possible (vs. #if-selecting code). I think one of the
> biggest lessons learned from Open MPI is that everyone's setup is different
> -- having the ability to mix and match various run-time options, while not
> widely used, is absolutely critical in some scenarios. So it might be good
> if --enable-large-system sets a bunch of default parameters that some
> sysadmins may still want/need to override.
>
> - decision to run the modex: I haven't seen all of Ralph's work in this
> area, but I wonder if it's similar to the MPI handle parameter checks: it
> could be a multi-value MCA parameter, such as: "never", "always",
> "when-ompi-determines-its-necessary", etc., where the last value can use
> multiple criteria to know if it's necessary to do a modex (e.g., job size,
> when spawn occurs, whether the "pml" [or other critical] MCA param[s] were
> specified, ...etc.).
>
>
>
> On Jun 26, 2008, at 9:26 AM, Ralph H Castain wrote:
>
> Just to complete this thread...
>>
>> Brian raised a very good point, so we identified it on the weekly telecon
>> as
>> a subject that really should be discussed at next week's technical
>> meeting.
>> I think we can find a reasonable answer, but there are several ways it can
>> be done. So rather than doing our usual piecemeal approach to the
>> solution,
>> it makes sense to begin talking about a more holistic design for
>> accommodating both needs.
>>
>> Thanks Brian for pointing out the bigger picture.
>> Ralph
>>
>>
>>
>> On 6/24/08 8:22 AM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>>
>> yeah, that could be a problem, but it's such a minority case and we've got
>>> to draw the line somewhere.
>>>
>>> Of course, it seems like this is a never ending battle between two
>>> opposing forces... The desire to do the "right thing" all the time at
>>> small and medium scale and the desire to scale out to the "big thing".
>>> It seems like in the quest to kill off the modex, we've run into these
>>> pretty often.
>>>
>>> The modex doesn't hurt us at small scale (indeed, we're probably ok with
>>> the routed communication pattern up to 512 nodes or so if we don't do
>>> anything stupid, maybe further). Is it time to admit defeat in this
>>> argument and have a configure option that turns off the modex (at the
>>> cost
>>> of some of these correctness checks) for the large machines, but keeps
>>> things simple for the common case? I'm sure there are other things where
>>> this will come up, so perhaps a --enable-large-scale? Maybe it's a dumb
>>> idea, but it seems like we've made a lot of compromises lately around
>>> this, where no one ends up really happy with the solution :/.
>>>
>>> Brian
>>>
>>>
>>> On Tue, 24 Jun 2008, George Bosilca wrote:
>>>
>>> Brian hinted a possible bug in one of his replies. How does this work in
>>>> the
>>>> case of dynamic processes? We can envision several scenarios, but lets
>>>> take a
>>>> simple: 2 jobs that get connected with connect/accept. One might publish
>>>> the
>>>> PML name (simply because the -mca argument was on) and one might not?
>>>>
>>>> george.
>>>>
>>>> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:
>>>>
>>>> Also sounds good to me.
>>>>>
>>>>> Note that the most difficult part of the forward-looking plan is that
>>>>> we
>>>>> usually can't tell the difference between "something failed to
>>>>> initialize"
>>>>> and "you don't have support for feature X".
>>>>>
>>>>> I like the general philosophy of: running out of the box always works
>>>>> just
>>>>> fine, but if you/the sysadmin is smart, you can get performance
>>>>> improvements.
>>>>>
>>>>>
>>>>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>>>>>
>>>>> I concur
>>>>>> - galen
>>>>>>
>>>>>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>>>>>>
>>>>>> That sounds like a reasonable plan to me.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>
>>>>>>> Okay, so let's explore an alternative that preserves the support you
>>>>>>>> are
>>>>>>>> seeking for the "ignorant user", but doesn't penalize everyone else.
>>>>>>>> What we
>>>>>>>> could do is simply set things up so that:
>>>>>>>>
>>>>>>>> 1. if -mca plm xyz is provided, then no modex data is added
>>>>>>>>
>>>>>>>> 2. if it is not provided, then only rank=0 inserts the data. All
>>>>>>>> other
>>>>>>>> procs
>>>>>>>> simply check their own selection against the one given by rank=0
>>>>>>>>
>>>>>>>> Now, if a knowledgeable user or sys admin specifies what to use for
>>>>>>>> their
>>>>>>>> system, we won't penalize their startup time. A user who doesn't
>>>>>>>> know
>>>>>>>> what
>>>>>>>> to do gets to run, albeit less scalably on startup.
>>>>>>>>
>>>>>>>> Looking forward from there, we can look to a day where failing to
>>>>>>>> initialize
>>>>>>>> something that exists on the system could be detected in some other
>>>>>>>> fashion,
>>>>>>>> letting the local proc abort since it would know that other procs
>>>>>>>> that
>>>>>>>> detected similar capabilities may well have selected that PML. For
>>>>>>>> now,
>>>>>>>> though, this would solve the problem.
>>>>>>>>
>>>>>>>> Make sense?
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> The problem is that we default to OB1, but that's not the right
>>>>>>>>> choice
>>>>>>>>> for
>>>>>>>>> some platforms (like Pathscale / PSM), where there's a huge
>>>>>>>>> performance
>>>>>>>>> hit for using OB1. So we run into a situation where user installs
>>>>>>>>> Open
>>>>>>>>> MPI, starts running, gets horrible performance, bad mouths Open
>>>>>>>>> MPI,
>>>>>>>>> and
>>>>>>>>> now we're in that game again. Yeah, the sys admin should know what
>>>>>>>>> to
>>>>>>>>> do,
>>>>>>>>> but it doesn't always work that way.
>>>>>>>>>
>>>>>>>>> Brian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>>>
>>>>>>>>> My fault - I should be more precise in my language. ;-/
>>>>>>>>>>
>>>>>>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a modex.
>>>>>>>>>> It
>>>>>>>>>> seems
>>>>>>>>>> to me that a simpler solution to what you describe is for the user
>>>>>>>>>> to
>>>>>>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you
>>>>>>>>>> could
>>>>>>>>>> deal
>>>>>>>>>> with the failed-to-initialize problem cleanly by having the proc
>>>>>>>>>> directly
>>>>>>>>>> abort.
>>>>>>>>>>
>>>>>>>>>> Again, sometimes I think we attempt to automate too many things.
>>>>>>>>>> This
>>>>>>>>>> seems
>>>>>>>>>> like a pretty clear case where you know what you want - the sys
>>>>>>>>>> admin,
>>>>>>>>>> if
>>>>>>>>>> nobody else, can certainly set that mca param in the default param
>>>>>>>>>> file!
>>>>>>>>>>
>>>>>>>>>> Otherwise, it seems to me that you are relying on the modex to
>>>>>>>>>> detect
>>>>>>>>>> that
>>>>>>>>>> your proc failed to init the correct subsystem. I hate to force a
>>>>>>>>>> modex just
>>>>>>>>>> for that - if so, then perhaps this could again be a settable
>>>>>>>>>> option
>>>>>>>>>> to
>>>>>>>>>> avoid requiring non-scalable behavior for those of us who want
>>>>>>>>>> scalability?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> The selection code was added because frequently high speed
>>>>>>>>>>> interconnects
>>>>>>>>>>> fail to initialize properly due to random stuff happening (yes,
>>>>>>>>>>> that's a
>>>>>>>>>>> horrible statement, but true). We ran into a situation with some
>>>>>>>>>>> really
>>>>>>>>>>> flaky machines where most of the processes would chose CM, but a
>>>>>>>>>>> couple
>>>>>>>>>>> would fail to initialize the MTL and therefore chose OB1. This
>>>>>>>>>>> lead
>>>>>>>>>>> to a
>>>>>>>>>>> hang situation, which is the worst of the worst.
>>>>>>>>>>>
>>>>>>>>>>> I think #1 is adequate, although it doesn't handle spawn
>>>>>>>>>>> particularly
>>>>>>>>>>> well. And spawn is generally used in environments where such
>>>>>>>>>>> network
>>>>>>>>>>> mismatches are most likely to occur.
>>>>>>>>>>>
>>>>>>>>>>> Brian
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>>>>>
>>>>>>>>>>> Since my goal is to eliminate the modex completely for managed
>>>>>>>>>>>> installations, could you give me a brief understanding of this
>>>>>>>>>>>> eventual PML
>>>>>>>>>>>> selection logic? It would help to hear an example of how and why
>>>>>>>>>>>> different
>>>>>>>>>>>> procs could get different answers - and why we would want to
>>>>>>>>>>>> allow
>>>>>>>>>>>> them to
>>>>>>>>>>>> do so.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Ralph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <
>>>>>>>>>>>> bouteill_at_[hidden]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> The first approach sounds fair enough to me. We should avoid 2
>>>>>>>>>>>>> and
>>>>>>>>>>>>> 3
>>>>>>>>>>>>> as the pml selection mechanism used to be
>>>>>>>>>>>>> more complex before we reduced it to accommodate a major design
>>>>>>>>>>>>> bug
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the BTL selection process. When using the complete PML
>>>>>>>>>>>>> selection,
>>>>>>>>>>>>> BTL
>>>>>>>>>>>>> would be initialized several times, leading to a variety of
>>>>>>>>>>>>> bugs.
>>>>>>>>>>>>> Eventually the PML selection should return to its old self,
>>>>>>>>>>>>> when
>>>>>>>>>>>>> the
>>>>>>>>>>>>> BTL bug gets fixed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Aurelien
>>>>>>>>>>>>>
>>>>>>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yo all
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've been doing further research into the modex and came
>>>>>>>>>>>>>> across
>>>>>>>>>>>>>> something I
>>>>>>>>>>>>>> don't fully understand. It seems we have each process insert
>>>>>>>>>>>>>> into
>>>>>>>>>>>>>> the modex
>>>>>>>>>>>>>> the name of the PML module that it selected. Once the modex
>>>>>>>>>>>>>> has
>>>>>>>>>>>>>> exchanged
>>>>>>>>>>>>>> that info, it then loops across all procs in the job to check
>>>>>>>>>>>>>> their
>>>>>>>>>>>>>> selection, and aborts if any proc picked a different PML
>>>>>>>>>>>>>> module.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All well and good...assuming that procs actually -can- choose
>>>>>>>>>>>>>> different PML
>>>>>>>>>>>>>> modules and hence create an "abort" scenario. However, if I
>>>>>>>>>>>>>> look
>>>>>>>>>>>>>> inside the
>>>>>>>>>>>>>> PML's at their selection logic, I find that a proc can ONLY
>>>>>>>>>>>>>> pick a
>>>>>>>>>>>>>> module
>>>>>>>>>>>>>> other than ob1 if:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or by
>>>>>>>>>>>>>> using a
>>>>>>>>>>>>>> module specific mca param to adjust its priority. In this
>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> since the
>>>>>>>>>>>>>> mca param is propagated, ALL procs have no choice but to pick
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>> module, so that can't cause us to abort (we will have already
>>>>>>>>>>>>>> returned an
>>>>>>>>>>>>>> error and aborted if the specified module can't run).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. the pml/cm module detects that an MTL module was selected,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> that it is
>>>>>>>>>>>>>> other than "psm". In this case, the CM module will be selected
>>>>>>>>>>>>>> because its
>>>>>>>>>>>>>> default priority is higher than that of OB1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In looking deeper into the MTL selection logic, it appears to
>>>>>>>>>>>>>> me
>>>>>>>>>>>>>> that you
>>>>>>>>>>>>>> either have the required capability or you don't. I can see
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> environments (e.g., rsh across unmanaged collections of
>>>>>>>>>>>>>> machines),
>>>>>>>>>>>>>> it might
>>>>>>>>>>>>>> be possible for someone to launch across a set of machines
>>>>>>>>>>>>>> where
>>>>>>>>>>>>>> some do and
>>>>>>>>>>>>>> some don't have the required support. However, in all other
>>>>>>>>>>>>>> cases,
>>>>>>>>>>>>>> this will
>>>>>>>>>>>>>> be homogeneous across the system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Given this analysis (and someone more familiar with the PML
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>> feel free
>>>>>>>>>>>>>> to confirm or correct it), it seems to me that this could be
>>>>>>>>>>>>>> streamlined via
>>>>>>>>>>>>>> one or more means:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. at the most, we could have rank=0 add the PML module name
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> modex,
>>>>>>>>>>>>>> and other procs simply check it against their own and return
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>> they differ. This accomplishes the identical functionality to
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> we have
>>>>>>>>>>>>>> today, but with much less info in the modex.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. we could eliminate this info from the modex altogether by
>>>>>>>>>>>>>> requiring the
>>>>>>>>>>>>>> user to specify the PML module if they want something other
>>>>>>>>>>>>>> than
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> default
>>>>>>>>>>>>>> OB1. In this case, there can be no confusion over what each
>>>>>>>>>>>>>> proc
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> to use.
>>>>>>>>>>>>>> The CM module will attempt to init the MTL - if it cannot do
>>>>>>>>>>>>>> so,
>>>>>>>>>>>>>> then the
>>>>>>>>>>>>>> job will return the correct error and tell the user that
>>>>>>>>>>>>>> CM/MTL
>>>>>>>>>>>>>> support is
>>>>>>>>>>>>>> unavailable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 3. we could again eliminate the info by not inserting it into
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> modex if
>>>>>>>>>>>>>> (a) the default PML module is selected, or (b) the user
>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>> the PML
>>>>>>>>>>>>>> module to be used. In the first case, each proc can simply
>>>>>>>>>>>>>> check
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>> they picked the default - if not, then we can insert the info
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> indicate
>>>>>>>>>>>>>> the difference. Thus, in the "standard" case, no info will be
>>>>>>>>>>>>>> inserted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the second case, we will already get an error if the
>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>> PML module
>>>>>>>>>>>>>> could not be used. Hence, the modex check provides no
>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>> info or
>>>>>>>>>>>>>> value.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I understand the motivation to support automation. However, in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> the automation actually doesn't seem to buy us very much, and
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>> coming "free". So perhaps some change in how this is done
>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>> in order?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>