Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] PML selection logic
From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2008-06-23 15:44:32


That sounds like a reasonable plan to me.

Brian

On Mon, 23 Jun 2008, Ralph H Castain wrote:

> Okay, so let's explore an alternative that preserves the support you are
> seeking for the "ignorant user", but doesn't penalize everyone else. What we
> could do is simply set things up so that:
>
> 1. if -mca plm xyz is provided, then no modex data is added
>
> 2. if it is not provided, then only rank=0 inserts the data. All other procs
> simply check their own selection against the one given by rank=0
>
> Now, if a knowledgeable user or sys admin specifies what to use for their
> system, we won't penalize their startup time. A user who doesn't know what
> to do gets to run, albeit less scalably on startup.
>
> Looking forward from there, we can look to a day where failing to initialize
> something that exists on the system could be detected in some other fashion,
> letting the local proc abort since it would know that other procs that
> detected similar capabilities may well have selected that PML. For now,
> though, this would solve the problem.
>
> Make sense?
> Ralph
>
>
>
> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>
>> The problem is that we default to OB1, but that's not the right choice for
>> some platforms (like Pathscale / PSM), where there's a huge performance
>> hit for using OB1. So we run into a situation where user installs Open
>> MPI, starts running, gets horrible performance, bad mouths Open MPI, and
>> now we're in that game again. Yeah, the sys admin should know what to do,
>> but it doesn't always work that way.
>>
>> Brian
>>
>>
>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>
>>> My fault - I should be more precise in my language. ;-/
>>>
>>> #1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
>>> to me that a simpler solution to what you describe is for the user to
>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
>>> with the failed-to-initialize problem cleanly by having the proc directly
>>> abort.
>>>
>>> Again, sometimes I think we attempt to automate too many things. This seems
>>> like a pretty clear case where you know what you want - the sys admin, if
>>> nobody else, can certainly set that mca param in the default param file!
>>>
>>> Otherwise, it seems to me that you are relying on the modex to detect that
>>> your proc failed to init the correct subsystem. I hate to force a modex just
>>> for that - if so, then perhaps this could again be a settable option to
>>> avoid requiring non-scalable behavior for those of us who want scalability?
>>>
>>>
>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbarret_at_[hidden]> wrote:
>>>
>>>> The selection code was added because frequently high speed interconnects
>>>> fail to initialize properly due to random stuff happening (yes, that's a
>>>> horrible statement, but true). We ran into a situation with some really
>>>> flaky machines where most of the processes would chose CM, but a couple
>>>> would fail to initialize the MTL and therefore chose OB1. This lead to a
>>>> hang situation, which is the worst of the worst.
>>>>
>>>> I think #1 is adequate, although it doesn't handle spawn particularly
>>>> well. And spawn is generally used in environments where such network
>>>> mismatches are most likely to occur.
>>>>
>>>> Brian
>>>>
>>>>
>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>
>>>>> Since my goal is to eliminate the modex completely for managed
>>>>> installations, could you give me a brief understanding of this eventual PML
>>>>> selection logic? It would help to hear an example of how and why different
>>>>> procs could get different answers - and why we would want to allow them to
>>>>> do so.
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>
>>>>>
>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <bouteill_at_[hidden]> wrote:
>>>>>
>>>>>> The first approach sounds fair enough to me. We should avoid 2 and 3
>>>>>> as the pml selection mechanism used to be
>>>>>> more complex before we reduced it to accommodate a major design bug in
>>>>>> the BTL selection process. When using the complete PML selection, BTL
>>>>>> would be initialized several times, leading to a variety of bugs.
>>>>>> Eventually the PML selection should return to its old self, when the
>>>>>> BTL bug gets fixed.
>>>>>>
>>>>>> Aurelien
>>>>>>
>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>
>>>>>>> Yo all
>>>>>>>
>>>>>>> I've been doing further research into the modex and came across
>>>>>>> something I
>>>>>>> don't fully understand. It seems we have each process insert into
>>>>>>> the modex
>>>>>>> the name of the PML module that it selected. Once the modex has
>>>>>>> exchanged
>>>>>>> that info, it then loops across all procs in the job to check their
>>>>>>> selection, and aborts if any proc picked a different PML module.
>>>>>>>
>>>>>>> All well and good...assuming that procs actually -can- choose
>>>>>>> different PML
>>>>>>> modules and hence create an "abort" scenario. However, if I look
>>>>>>> inside the
>>>>>>> PML's at their selection logic, I find that a proc can ONLY pick a
>>>>>>> module
>>>>>>> other than ob1 if:
>>>>>>>
>>>>>>> 1. the user specifies the module to use via -mca pml xyz or by using a
>>>>>>> module specific mca param to adjust its priority. In this case,
>>>>>>> since the
>>>>>>> mca param is propagated, ALL procs have no choice but to pick that
>>>>>>> same
>>>>>>> module, so that can't cause us to abort (we will have already
>>>>>>> returned an
>>>>>>> error and aborted if the specified module can't run).
>>>>>>>
>>>>>>> 2. the pml/cm module detects that an MTL module was selected, and
>>>>>>> that it is
>>>>>>> other than "psm". In this case, the CM module will be selected
>>>>>>> because its
>>>>>>> default priority is higher than that of OB1.
>>>>>>>
>>>>>>> In looking deeper into the MTL selection logic, it appears to me
>>>>>>> that you
>>>>>>> either have the required capability or you don't. I can see that in
>>>>>>> some
>>>>>>> environments (e.g., rsh across unmanaged collections of machines),
>>>>>>> it might
>>>>>>> be possible for someone to launch across a set of machines where
>>>>>>> some do and
>>>>>>> some don't have the required support. However, in all other cases,
>>>>>>> this will
>>>>>>> be homogeneous across the system.
>>>>>>>
>>>>>>> Given this analysis (and someone more familiar with the PML should
>>>>>>> feel free
>>>>>>> to confirm or correct it), it seems to me that this could be
>>>>>>> streamlined via
>>>>>>> one or more means:
>>>>>>>
>>>>>>> 1. at the most, we could have rank=0 add the PML module name to the
>>>>>>> modex,
>>>>>>> and other procs simply check it against their own and return an
>>>>>>> error if
>>>>>>> they differ. This accomplishes the identical functionality to what
>>>>>>> we have
>>>>>>> today, but with much less info in the modex.
>>>>>>>
>>>>>>> 2. we could eliminate this info from the modex altogether by
>>>>>>> requiring the
>>>>>>> user to specify the PML module if they want something other than the
>>>>>>> default
>>>>>>> OB1. In this case, there can be no confusion over what each proc is
>>>>>>> to use.
>>>>>>> The CM module will attempt to init the MTL - if it cannot do so,
>>>>>>> then the
>>>>>>> job will return the correct error and tell the user that CM/MTL
>>>>>>> support is
>>>>>>> unavailable.
>>>>>>>
>>>>>>> 3. we could again eliminate the info by not inserting it into the
>>>>>>> modex if
>>>>>>> (a) the default PML module is selected, or (b) the user specified
>>>>>>> the PML
>>>>>>> module to be used. In the first case, each proc can simply check to
>>>>>>> see if
>>>>>>> they picked the default - if not, then we can insert the info to
>>>>>>> indicate
>>>>>>> the difference. Thus, in the "standard" case, no info will be
>>>>>>> inserted.
>>>>>>>
>>>>>>> In the second case, we will already get an error if the specified
>>>>>>> PML module
>>>>>>> could not be used. Hence, the modex check provides no additional
>>>>>>> info or
>>>>>>> value.
>>>>>>>
>>>>>>> I understand the motivation to support automation. However, in this
>>>>>>> case,
>>>>>>> the automation actually doesn't seem to buy us very much, and it isn't
>>>>>>> coming "free". So perhaps some change in how this is done would be
>>>>>>> in order?
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>