Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-07-10 13:26:31


I think that is quite accurate and would be helpful in resolving the
problem...

On 7/10/07 10:32 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:

> Point taken.
>
> Is this an accurate summary?
>
> 1. "Best practices" should be documented, to include sysadmins
> specifically itemizing what components should be used on their
> systems (e.g., in an environment variable or the system-wide MCA
> parameters file).
>
> 2. It may be useful to have some high-level parameters to specify a
> specific run-time environment, since ORTE has multiple, related
> frameworks (e.g., RAS and PLS). E.g., "orte_base_launcher=tm", or
> somesuch.
>
>
> On Jul 10, 2007, at 9:08 AM, Ralph H Castain wrote:
>
>> Actually, I was talking specifically about configuration at build
>> time. I
>> realize there are trade-offs here, and suspect we can find a common
>> ground.
>> The problem with using the options Jeff described is that they require
>> knowledge on the part of the builder as to what environments have
>> had their
>> include files/libraries installed on the file system of this
>> particular
>> machine. And unfortunately, not every component is protected by these
>> "sentinel" variables, nor does it appear possible to do so in a
>> "guaranteed
>> safe" manner.
>>
>> Note that I didn't say "installed on their machine". In most cases,
>> these
>> alternative environments are not currently installed at all - they
>> are stale
>> files, or were placed on the file system by someone that wanted to
>> look at
>> their documentation, or whatever. The problem is that Open MPI
>> blindly picks
>> them up and attempts to use them, with sometimes disastrous and
>> frequently
>> unpredictable ways.
>>
>> Hence, the user can be "astonished" to find that an application
>> that worked
>> perfectly yesterday suddenly segfaults today - because someone
>> decided one
>> day, for example, to un-tar the bproc files in a public place where
>> we pick
>> them up, and then someone else (perhaps a sys admin or the user
>> themselves)
>> at some later time rebuilt Open MPI to bring in an update.
>>
>> Now imagine being a software provider who gets the call about a
>> problem with
>> Open MPI and has to figure out what the heck happened....
>>
>> My suggested solution may not be the best, which is why I put it
>> out there
>> for discussion. One alternative might be for us to instruct sys
>> admins to
>> put MCA params in their default param file that force selection of the
>> proper components for each framework. Thus, someone with an lsf
>> system would
>> enter: pls=lsf ras=lsf sds=lsf in their config file to ensure that
>> only lsf
>> was used.
>>
>> The negative to that approach is that we would have to warn
>> everyone any
>> time that list changed (e.g., a new component for a new framework).
>> Another
>> option to help that problem, of course, would be to set one mca
>> param (say
>> something like "enviro=lsf") that we would use internal to Open MPI
>> to set
>> the individual components correctly - i.e., we would hold the list of
>> relevant frameworks internally since (hopefully) we know what they
>> should be
>> for a given environment.
>>
>> Anyway, I'm glad people are looking at this and suggesting
>> solutions. It is
>> a problem that seems to be biting us recently and may become a
>> bigger issue
>> as the user community grows.
>>
>> Ralph
>>
>>
>> On 7/10/07 6:12 AM, "Bogdan Costescu"
>> <Bogdan.Costescu_at_[hidden]> wrote:
>>
>>> On Tue, 10 Jul 2007, Jeff Squyres wrote:
>>>
>>>> Do either of these work for you?
>>>
>>> Will report back in a bit, I'm now in the middle of an OS upgrade on
>>> the cluster.
>>>
>>> But my question was more like: is this a configuration that should
>>> theoretically work ? Or in other words, are there known dependencies
>>> on rsh that would make a rsh-less build not work or work with reduced
>>> functionality ?
>>>
>>>> Most batch systems today set a sentinel environment variable that we
>>>> check for.
>>>
>>> I think that we talk about slightly different things - my impression
>>> was that the OP was asking about detection at config time, while your
>>> statements make perfect sense to me if they are relative to detection
>>> at run-time. If the OP was indeed asking about run-time detection,
>>> then I apologize for the time you wasted on reading and replying
>>> to my
>>> questions...
>>>
>>>> That's what the compile-time vs. run-time detection and selection is
>>>> supposed to be for.
>>>
>>> Yes, I understand that, it's the same type of mechanism as in LAM/MPI
>>> which it's not that foreign to me ;-)
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>