Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Multiworld MCA parameter values broken
From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-11-19 08:39:06


I'm not sure it is really necessary - the problem is solely within
opal_cmd_line_parse and (if someone can parse that code ;-)) is truly simple
to fix. The overly long cmd line issue is due to a bug that Josh was going
to look at (may already have done so while I was out of touch).

Ralph

On 11/9/07 5:10 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:

> Should there be another option for passing MCA parameters between
> processes, such as via stdin (or any file descriptor)? I.e., during
> the command line parsing to check for command line MCA params, perhaps
> a new argument could be introduced: -mcauri <uri>, where <uri> could
> be a few different forms:
>
> - file://stdin: (note the 2 //, not 3, so "stdin" would never conflict
> with a real file named /stdin) Read the parameters in off stdin.
>
> - rml://...rml contact info...: read in the MCA params via the RML
> (although I assume that reading via the RML would be *waaaay* to late
> during the MCA setup process -- I mentioned this option for
> completeness, even though I don't think it'll work)
>
> - ip://ipaddress:port: open a socket back and read the MCA params in
> over a socket. This could have some scalability issues...? But who
> knows; it could be tied into the hierarchical startup such that we
> wouldn't have to have an all-to-one connection scheme. Certainly it
> would cause scalability problems when paired with today's all-to-one
> RML connection scheme for the OOB.
>
> I'm not sure that the rml: and ip: schemes are worthwhile. Maybe a
> file://stdin kind of approach could work? Or perhaps some other kind
> of URI/IPC...? (I really haven't thought through the issues -- this
> is off the top of my head)
>
>
>
> On Nov 8, 2007, at 2:36 PM, Ralph H Castain wrote:
>
>> Might I suggest:
>>
>> https://svn.open-mpi.org/trac/ompi/ticket/1073
>>
>> It deals with some of these issues and explains the boundaries of the
>> problem. As for what a string param can contain, I have no opinion.
>> I only
>> note that it must handle special characters such as ';', '/', etc.
>> that are
>> typically found in uri's. I cannot think of any reason it should
>> have a
>> quote in it.
>>
>> Ralph
>>
>>
>>
>> On 11/8/07 12:25 PM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>
>>> The alias option you presented does not work. I think we do some
>>> weird
>>> things to find the absolute path for ssh, instead of just issuing the
>>> command.
>>>
>>> I would spend some time fixing this, but I don't want to do it
>>> wrong. We
>>> could quote all the param values, and change the parser to remove the
>>> quotes, but this is assuming that a mca param does not contain
>>> quotes.
>>>
>>> So I guess there are 2 questions that need to be answered before a
>>> fix
>>> is made:
>>>
>>> 1. What exactly can a string mca param contain? Can it have quotes or
>>> spaces or?
>>>
>>> 2. Which mca parameters should be forwarded? Should it be just the
>>> ones
>>> from the command line? From the environment? From config files?
>>>
>>> Tim
>>>
>>> Ralph Castain wrote:
>>>> What changed is that we never passed mca params to the orted
>>>> before - they
>>>> always went to the app, but it's the orted that has the issue.
>>>> There is a
>>>> bug ticket thread on this subject - I forget the number immediately.
>>>>
>>>> Basically, the problem was that we cannot generally pass the local
>>>> environment to the orteds when we launch them. However, people
>>>> needed
>>>> various mca params to get to the orteds to control their behavior.
>>>> The only
>>>> way to resolve that problem was to pass the params via the command
>>>> line,
>>>> which is what was done.
>>>>
>>>> Except for a very few cases, all of our mca params are single
>>>> values that do
>>>> not include spaces, so this is not a problem that is causing
>>>> widespread
>>>> issues. As I said, I already had to deal with one special case
>>>> that didn't
>>>> involve spaces, but did have special characters that required
>>>> quoting, which
>>>> identified the larger problem of dealing with quoted strings.
>>>>
>>>> I have no objection to a more general fix. Like I said in my note,
>>>> though,
>>>> the general fix will take a larger effort. If someone is willing
>>>> to do so,
>>>> that is fine with me - I was only offering solutions that would
>>>> fill the
>>>> interim time as I haven't heard anyone step up to say they would
>>>> fix it
>>>> anytime soon.
>>>>
>>>> Please feel free to jump in and volunteer! ;-) I'm willing to put
>>>> the quotes
>>>> around things if you will fix the mca cmd line parser to cleanly
>>>> remove them
>>>> on the other end.
>>>>
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On 11/7/07 5:50 PM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>>>
>>>>> I'm curious what changed to make this a problem. How were we
>>>>> passing mca
>>>>> param
>>>>> from the base to the app before, and why did it change?
>>>>>
>>>>> I think that options 1 & 2 below are no good, since we, in
>>>>> general, allow
>>>>> string mca params to have spaces (as far as I understand it). So
>>>>> a more
>>>>> general approach is needed.
>>>>>
>>>>> Tim
>>>>>
>>>>> On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote:
>>>>>> Sorry for delay - wasn't ignoring the issue.
>>>>>>
>>>>>> There are several fixes to this problem - ranging in order from
>>>>>> least to
>>>>>> most work:
>>>>>>
>>>>>> 1. just alias "ssh" to be "ssh -Y" and run without setting the
>>>>>> mca param.
>>>>>> It won't affect anything on the backend because the daemon/procs
>>>>>> don't use
>>>>>> ssh.
>>>>>>
>>>>>> 2. include "pls_rsh_agent" in the array of mca params not to be
>>>>>> passed to
>>>>>> the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the
>>>>>> orte_pls_base_orted_append_basic_args function. This would fix
>>>>>> the specific
>>>>>> problem cited here, but I admit that listing every such param by
>>>>>> name would
>>>>>> get tedious.
>>>>>>
>>>>>> 3. we could easily detect that a "problem" character was in the
>>>>>> mca param
>>>>>> value when we add it to the orted's argv, and then put "" around
>>>>>> it. The
>>>>>> problem, however, is that the mca param parser on the far end
>>>>>> doesn't
>>>>>> remove those "" from the resulting string. At least, I spent
>>>>>> over a day
>>>>>> fighting with a problem only to discover that was happening.
>>>>>> Could be an
>>>>>> error in the way I was doing things, or could be a real
>>>>>> characteristic of
>>>>>> the parser. Anyway, we would have to ensure that the parser
>>>>>> removes any
>>>>>> surrounding "" before passing along the param value or this
>>>>>> won't work.
>>>>>>
>>>>>> Ralph
>>>>>>
>>>>>> On 11/5/07 12:10 PM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Commit 16364 broke things when using multiword mca param
>>>>>>> values. For
>>>>>>> instance:
>>>>>>>
>>>>>>> mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca
>>>>>>> pls_rsh_agent
>>>>>>> "ssh -Y" xterm
>>>>>>>
>>>>>>> Will crash and burn, because the value "ssh -Y" is being stored
>>>>>>> into the
>>>>>>> argv orted_cmd_line in orterun.c:1506. This is then added to
>>>>>>> the launch
>>>>>>> command for the orted:
>>>>>>>
>>>>>>> /usr/bin/ssh -Y odin004 PATH=/san/homedirs/tprins/usr/rsl/bin:
>>>>>>> $PATH ;
>>>>>>> export PATH ;
>>>>>>> LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:
>>>>>>> $LD_LIBRARY_PATH ;
>>>>>>> export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted
>>>>>>> --debug
>>>>>>> --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --
>>>>>>> nodename
>>>>>>> odin004 --universe tprins_at_[hidden]:default-
>>>>>>> universe-27872
>>>>>>> --nsreplica
>>>>>>> "0.0;tcp://
>>>>>>> 129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
>>>>>>> :4090 8"
>>>>>>> --gprreplica
>>>>>>> "0.0;tcp://
>>>>>>> 129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
>>>>>>> :4090 8"
>>>>>>> -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
>>>>>>> mca_base_param_file_path
>>>>>>> /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/
>>>>>>> tprins/rsl/
>>>>>>> examp les
>>>>>>> -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/
>>>>>>> examples
>>>>>>>
>>>>>>> Notice that in this command we now have "-mca pls_rsh_agent ssh
>>>>>>> -Y". So
>>>>>>> the quotes have been lost, as we die a horrible death.
>>>>>>>
>>>>>>> So we need to add the quotes back in somehow, or pass these
>>>>>>> options
>>>>>>> differently. I'm not sure what the best way to fix this.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Tim
>>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>