Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] PML selection logic
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-28 08:55:37


Agreed. I have a few ideas in this direction as well (random thoughts
that might as well be transcribed somewhere):

- some kind of configure --enable-large-system (whatever) option is a
Good Thing

- it would be good if the configure option simply set [MCA parameter?]
defaults wherever possible (vs. #if-selecting code). I think one of
the biggest lessons learned from Open MPI is that everyone's setup is
different -- having the ability to mix and match various run-time
options, while not widely used, is absolutely critical in some
scenarios. So it might be good if --enable-large-system sets a bunch
of default parameters that some sysadmins may still want/need to
override.

- decision to run the modex: I haven't seen all of Ralph's work in
this area, but I wonder if it's similar to the MPI handle parameter
checks: it could be a multi-value MCA parameter, such as: "never",
"always", "when-ompi-determines-its-necessary", etc., where the last
value can use multiple criteria to know if it's necessary to do a
modex (e.g., job size, when spawn occurs, whether the "pml" [or other
critical] MCA param[s] were specified, ...etc.).

On Jun 26, 2008, at 9:26 AM, Ralph H Castain wrote:

> Just to complete this thread...
>
> Brian raised a very good point, so we identified it on the weekly
> telecon as
> a subject that really should be discussed at next week's technical
> meeting.
> I think we can find a reasonable answer, but there are several ways
> it can
> be done. So rather than doing our usual piecemeal approach to the
> solution,
> it makes sense to begin talking about a more holistic design for
> accommodating both needs.
>
> Thanks Brian for pointing out the bigger picture.
> Ralph
>
>
>
> On 6/24/08 8:22 AM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>
>> yeah, that could be a problem, but it's such a minority case and
>> we've got
>> to draw the line somewhere.
>>
>> Of course, it seems like this is a never ending battle between two
>> opposing forces... The desire to do the "right thing" all the time
>> at
>> small and medium scale and the desire to scale out to the "big
>> thing".
>> It seems like in the quest to kill off the modex, we've run into
>> these
>> pretty often.
>>
>> The modex doesn't hurt us at small scale (indeed, we're probably ok
>> with
>> the routed communication pattern up to 512 nodes or so if we don't do
>> anything stupid, maybe further). Is it time to admit defeat in this
>> argument and have a configure option that turns off the modex (at
>> the cost
>> of some of these correctness checks) for the large machines, but
>> keeps
>> things simple for the common case? I'm sure there are other things
>> where
>> this will come up, so perhaps a --enable-large-scale? Maybe it's a
>> dumb
>> idea, but it seems like we've made a lot of compromises lately around
>> this, where no one ends up really happy with the solution :/.
>>
>> Brian
>>
>>
>> On Tue, 24 Jun 2008, George Bosilca wrote:
>>
>>> Brian hinted a possible bug in one of his replies. How does this
>>> work in the
>>> case of dynamic processes? We can envision several scenarios, but
>>> lets take a
>>> simple: 2 jobs that get connected with connect/accept. One might
>>> publish the
>>> PML name (simply because the -mca argument was on) and one might
>>> not?
>>>
>>> george.
>>>
>>> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:
>>>
>>>> Also sounds good to me.
>>>>
>>>> Note that the most difficult part of the forward-looking plan is
>>>> that we
>>>> usually can't tell the difference between "something failed to
>>>> initialize"
>>>> and "you don't have support for feature X".
>>>>
>>>> I like the general philosophy of: running out of the box always
>>>> works just
>>>> fine, but if you/the sysadmin is smart, you can get performance
>>>> improvements.
>>>>
>>>>
>>>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>>>>
>>>>> I concur
>>>>> - galen
>>>>>
>>>>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>>>>>
>>>>>> That sounds like a reasonable plan to me.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>
>>>>>>> Okay, so let's explore an alternative that preserves the
>>>>>>> support you are
>>>>>>> seeking for the "ignorant user", but doesn't penalize everyone
>>>>>>> else.
>>>>>>> What we
>>>>>>> could do is simply set things up so that:
>>>>>>>
>>>>>>> 1. if -mca plm xyz is provided, then no modex data is added
>>>>>>>
>>>>>>> 2. if it is not provided, then only rank=0 inserts the data.
>>>>>>> All other
>>>>>>> procs
>>>>>>> simply check their own selection against the one given by rank=0
>>>>>>>
>>>>>>> Now, if a knowledgeable user or sys admin specifies what to
>>>>>>> use for
>>>>>>> their
>>>>>>> system, we won't penalize their startup time. A user who
>>>>>>> doesn't know
>>>>>>> what
>>>>>>> to do gets to run, albeit less scalably on startup.
>>>>>>>
>>>>>>> Looking forward from there, we can look to a day where failing
>>>>>>> to
>>>>>>> initialize
>>>>>>> something that exists on the system could be detected in some
>>>>>>> other
>>>>>>> fashion,
>>>>>>> letting the local proc abort since it would know that other
>>>>>>> procs that
>>>>>>> detected similar capabilities may well have selected that PML.
>>>>>>> For now,
>>>>>>> though, this would solve the problem.
>>>>>>>
>>>>>>> Make sense?
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbarret_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The problem is that we default to OB1, but that's not the
>>>>>>>> right choice
>>>>>>>> for
>>>>>>>> some platforms (like Pathscale / PSM), where there's a huge
>>>>>>>> performance
>>>>>>>> hit for using OB1. So we run into a situation where user
>>>>>>>> installs Open
>>>>>>>> MPI, starts running, gets horrible performance, bad mouths
>>>>>>>> Open MPI,
>>>>>>>> and
>>>>>>>> now we're in that game again. Yeah, the sys admin should
>>>>>>>> know what to
>>>>>>>> do,
>>>>>>>> but it doesn't always work that way.
>>>>>>>>
>>>>>>>> Brian
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>>
>>>>>>>>> My fault - I should be more precise in my language. ;-/
>>>>>>>>>
>>>>>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a
>>>>>>>>> modex. It
>>>>>>>>> seems
>>>>>>>>> to me that a simpler solution to what you describe is for
>>>>>>>>> the user to
>>>>>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then
>>>>>>>>> you could
>>>>>>>>> deal
>>>>>>>>> with the failed-to-initialize problem cleanly by having the
>>>>>>>>> proc
>>>>>>>>> directly
>>>>>>>>> abort.
>>>>>>>>>
>>>>>>>>> Again, sometimes I think we attempt to automate too many
>>>>>>>>> things. This
>>>>>>>>> seems
>>>>>>>>> like a pretty clear case where you know what you want - the
>>>>>>>>> sys admin,
>>>>>>>>> if
>>>>>>>>> nobody else, can certainly set that mca param in the default
>>>>>>>>> param
>>>>>>>>> file!
>>>>>>>>>
>>>>>>>>> Otherwise, it seems to me that you are relying on the modex
>>>>>>>>> to detect
>>>>>>>>> that
>>>>>>>>> your proc failed to init the correct subsystem. I hate to
>>>>>>>>> force a
>>>>>>>>> modex just
>>>>>>>>> for that - if so, then perhaps this could again be a
>>>>>>>>> settable option
>>>>>>>>> to
>>>>>>>>> avoid requiring non-scalable behavior for those of us who want
>>>>>>>>> scalability?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret_at_open-
>>>>>>>>> mpi.org> wrote:
>>>>>>>>>
>>>>>>>>>> The selection code was added because frequently high speed
>>>>>>>>>> interconnects
>>>>>>>>>> fail to initialize properly due to random stuff happening
>>>>>>>>>> (yes,
>>>>>>>>>> that's a
>>>>>>>>>> horrible statement, but true). We ran into a situation
>>>>>>>>>> with some
>>>>>>>>>> really
>>>>>>>>>> flaky machines where most of the processes would chose CM,
>>>>>>>>>> but a
>>>>>>>>>> couple
>>>>>>>>>> would fail to initialize the MTL and therefore chose OB1.
>>>>>>>>>> This lead
>>>>>>>>>> to a
>>>>>>>>>> hang situation, which is the worst of the worst.
>>>>>>>>>>
>>>>>>>>>> I think #1 is adequate, although it doesn't handle spawn
>>>>>>>>>> particularly
>>>>>>>>>> well. And spawn is generally used in environments where
>>>>>>>>>> such network
>>>>>>>>>> mismatches are most likely to occur.
>>>>>>>>>>
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>>>>
>>>>>>>>>>> Since my goal is to eliminate the modex completely for
>>>>>>>>>>> managed
>>>>>>>>>>> installations, could you give me a brief understanding of
>>>>>>>>>>> this
>>>>>>>>>>> eventual PML
>>>>>>>>>>> selection logic? It would help to hear an example of how
>>>>>>>>>>> and why
>>>>>>>>>>> different
>>>>>>>>>>> procs could get different answers - and why we would want
>>>>>>>>>>> to allow
>>>>>>>>>>> them to
>>>>>>>>>>> do so.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Ralph
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <bouteill_at_[hidden]
>>>>>>>>>>> >
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The first approach sounds fair enough to me. We should
>>>>>>>>>>>> avoid 2 and
>>>>>>>>>>>> 3
>>>>>>>>>>>> as the pml selection mechanism used to be
>>>>>>>>>>>> more complex before we reduced it to accommodate a major
>>>>>>>>>>>> design bug
>>>>>>>>>>>> in
>>>>>>>>>>>> the BTL selection process. When using the complete PML
>>>>>>>>>>>> selection,
>>>>>>>>>>>> BTL
>>>>>>>>>>>> would be initialized several times, leading to a variety
>>>>>>>>>>>> of bugs.
>>>>>>>>>>>> Eventually the PML selection should return to its old
>>>>>>>>>>>> self, when
>>>>>>>>>>>> the
>>>>>>>>>>>> BTL bug gets fixed.
>>>>>>>>>>>>
>>>>>>>>>>>> Aurelien
>>>>>>>>>>>>
>>>>>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>>>>>>>
>>>>>>>>>>>>> Yo all
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've been doing further research into the modex and came
>>>>>>>>>>>>> across
>>>>>>>>>>>>> something I
>>>>>>>>>>>>> don't fully understand. It seems we have each process
>>>>>>>>>>>>> insert into
>>>>>>>>>>>>> the modex
>>>>>>>>>>>>> the name of the PML module that it selected. Once the
>>>>>>>>>>>>> modex has
>>>>>>>>>>>>> exchanged
>>>>>>>>>>>>> that info, it then loops across all procs in the job to
>>>>>>>>>>>>> check
>>>>>>>>>>>>> their
>>>>>>>>>>>>> selection, and aborts if any proc picked a different PML
>>>>>>>>>>>>> module.
>>>>>>>>>>>>>
>>>>>>>>>>>>> All well and good...assuming that procs actually -can-
>>>>>>>>>>>>> choose
>>>>>>>>>>>>> different PML
>>>>>>>>>>>>> modules and hence create an "abort" scenario. However,
>>>>>>>>>>>>> if I look
>>>>>>>>>>>>> inside the
>>>>>>>>>>>>> PML's at their selection logic, I find that a proc can
>>>>>>>>>>>>> ONLY pick a
>>>>>>>>>>>>> module
>>>>>>>>>>>>> other than ob1 if:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz
>>>>>>>>>>>>> or by
>>>>>>>>>>>>> using a
>>>>>>>>>>>>> module specific mca param to adjust its priority. In
>>>>>>>>>>>>> this case,
>>>>>>>>>>>>> since the
>>>>>>>>>>>>> mca param is propagated, ALL procs have no choice but to
>>>>>>>>>>>>> pick that
>>>>>>>>>>>>> same
>>>>>>>>>>>>> module, so that can't cause us to abort (we will have
>>>>>>>>>>>>> already
>>>>>>>>>>>>> returned an
>>>>>>>>>>>>> error and aborted if the specified module can't run).
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. the pml/cm module detects that an MTL module was
>>>>>>>>>>>>> selected, and
>>>>>>>>>>>>> that it is
>>>>>>>>>>>>> other than "psm". In this case, the CM module will be
>>>>>>>>>>>>> selected
>>>>>>>>>>>>> because its
>>>>>>>>>>>>> default priority is higher than that of OB1.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In looking deeper into the MTL selection logic, it
>>>>>>>>>>>>> appears to me
>>>>>>>>>>>>> that you
>>>>>>>>>>>>> either have the required capability or you don't. I can
>>>>>>>>>>>>> see that
>>>>>>>>>>>>> in
>>>>>>>>>>>>> some
>>>>>>>>>>>>> environments (e.g., rsh across unmanaged collections of
>>>>>>>>>>>>> machines),
>>>>>>>>>>>>> it might
>>>>>>>>>>>>> be possible for someone to launch across a set of
>>>>>>>>>>>>> machines where
>>>>>>>>>>>>> some do and
>>>>>>>>>>>>> some don't have the required support. However, in all
>>>>>>>>>>>>> other cases,
>>>>>>>>>>>>> this will
>>>>>>>>>>>>> be homogeneous across the system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Given this analysis (and someone more familiar with the
>>>>>>>>>>>>> PML should
>>>>>>>>>>>>> feel free
>>>>>>>>>>>>> to confirm or correct it), it seems to me that this
>>>>>>>>>>>>> could be
>>>>>>>>>>>>> streamlined via
>>>>>>>>>>>>> one or more means:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. at the most, we could have rank=0 add the PML module
>>>>>>>>>>>>> name to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> modex,
>>>>>>>>>>>>> and other procs simply check it against their own and
>>>>>>>>>>>>> return an
>>>>>>>>>>>>> error if
>>>>>>>>>>>>> they differ. This accomplishes the identical
>>>>>>>>>>>>> functionality to what
>>>>>>>>>>>>> we have
>>>>>>>>>>>>> today, but with much less info in the modex.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. we could eliminate this info from the modex
>>>>>>>>>>>>> altogether by
>>>>>>>>>>>>> requiring the
>>>>>>>>>>>>> user to specify the PML module if they want something
>>>>>>>>>>>>> other than
>>>>>>>>>>>>> the
>>>>>>>>>>>>> default
>>>>>>>>>>>>> OB1. In this case, there can be no confusion over what
>>>>>>>>>>>>> each proc
>>>>>>>>>>>>> is
>>>>>>>>>>>>> to use.
>>>>>>>>>>>>> The CM module will attempt to init the MTL - if it
>>>>>>>>>>>>> cannot do so,
>>>>>>>>>>>>> then the
>>>>>>>>>>>>> job will return the correct error and tell the user that
>>>>>>>>>>>>> CM/MTL
>>>>>>>>>>>>> support is
>>>>>>>>>>>>> unavailable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. we could again eliminate the info by not inserting it
>>>>>>>>>>>>> into the
>>>>>>>>>>>>> modex if
>>>>>>>>>>>>> (a) the default PML module is selected, or (b) the user
>>>>>>>>>>>>> specified
>>>>>>>>>>>>> the PML
>>>>>>>>>>>>> module to be used. In the first case, each proc can
>>>>>>>>>>>>> simply check
>>>>>>>>>>>>> to
>>>>>>>>>>>>> see if
>>>>>>>>>>>>> they picked the default - if not, then we can insert the
>>>>>>>>>>>>> info to
>>>>>>>>>>>>> indicate
>>>>>>>>>>>>> the difference. Thus, in the "standard" case, no info
>>>>>>>>>>>>> will be
>>>>>>>>>>>>> inserted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the second case, we will already get an error if the
>>>>>>>>>>>>> specified
>>>>>>>>>>>>> PML module
>>>>>>>>>>>>> could not be used. Hence, the modex check provides no
>>>>>>>>>>>>> additional
>>>>>>>>>>>>> info or
>>>>>>>>>>>>> value.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I understand the motivation to support automation.
>>>>>>>>>>>>> However, in
>>>>>>>>>>>>> this
>>>>>>>>>>>>> case,
>>>>>>>>>>>>> the automation actually doesn't seem to buy us very
>>>>>>>>>>>>> much, and it
>>>>>>>>>>>>> isn't
>>>>>>>>>>>>> coming "free". So perhaps some change in how this is
>>>>>>>>>>>>> done would be
>>>>>>>>>>>>> in order?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems