Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Multiworld MCA parameter values broken
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-19 10:19:58


On Nov 19, 2007, at 10:15 AM, Ralph H Castain wrote:

> Hmmm...okay, I get the message: if it is going to be fixed, then I
> have to
> fix it on my end. ;-) Which means it ain't happening anytime soon,
> so Tim P
> is the one left in the lurch.
>
> Sorry Tim! Perhaps you can come up with a compromise that re-enables
> what
> you want to do.

Yep -- sorry Tim. :-(

Is this only a problem for the rsh/ssh launcher? I *think* that
should be the only launcher affected -- all others use raw argv string
arrays, right? If so, we already have a local/non-local flag in the
rsh launcher, right? I agree it kinda sucks, but it may be necessary
to construct argv differently between local/non-local (which we
already do, right?).

> On 11/19/07 8:11 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>
>> On Nov 19, 2007, at 10:01 AM, Ralph H Castain wrote:
>>
>>> Unfortunately, it -is- a significant problem when passing the params
>>> on to
>>> the orteds, as Tim has eloquently pointed out in his original
>>> posting. I
>>> guess I don't see why it would be a significant problem to just have
>>> opal_cmd_line_parse do the "right thing" - if a string param is
>>> quoted, then
>>> just remove the quotes.
>>
>> How would opal_cmd_line_parse know when to remove quotes and when not
>> remove quotes? I.e., what if an MCA param (or some other value, such
>> as a user application argv) required quotes? This is not uncommon
>> for
>> the wrapper compilers, for example (passing -D's to the C
>> preprocessor
>> that take string arguments).
>>
>> AFAIK, opal_cmd_line_parse() does exactly what I intended it to. :-)
>> It performs no shell-like quoting because it expects to be given a
>> list of "final" string tokens.
>>
>>> Problem solved, and I wouldn't have to add a bunch of code to figure
>>> out
>>> when to assemble argv arrays in different ways.
>>
>> Er... I don't understand: I joined this thread because I thought
>> there
>> was still something to fix...?
>>
>> I would be against changing opal_cmd_line_parse() in the manner that
>> you have described because it will break other things (e.g., the
>> wrapper compilers).
>>
>>
>>
>>>
>>>
>>> On 11/19/07 7:52 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>
>>>> I guess I don't see why this is an opal_cmd_line_parse() problem.
>>>>
>>>> If you invoke executables through system(), then a shell is used
>>>> and
>>>> quoting is necessary to preserve the individual string tokens
>>>> (i.e.,
>>>> "my beloved string" would be passed to the application as one argv
>>>> token, without the quotes).
>>>>
>>>> But if you're building up an array of argv and calling some form of
>>>> exec(), then no shell is involved and quoting should not be
>>>> necessary. Specifically:
>>>>
>>>> opal_append_argv(&argc, &argv, "my beloved string");
>>>>
>>>> will be passed as one string token to the application.
>>>>
>>>> opal_cmd_line_parse() is passed an array of argv, meaning that the
>>>> command line have already been split into individual string
>>>> tokens. I
>>>> guess the question is whether these come in directly from the
>>>> arguments to main() or whether we are getting a single string and
>>>> breaking it up into tokens. If the latter is true, then we need to
>>>> re-
>>>> evaluate our break-into-tokens algorithm. I have a dim
>>>> recollection
>>>> that these come in from the arguments to main(), though.
>>>>
>>>> I guess I can see where this would get complicated for rsh/ssh
>>>> invocations, because *both* models are used. (i.e., you exec
>>>> locally
>>>> but it turns into a system-like invocation on the remote side). In
>>>> this case, I think you'll need to quote extended strings (e.g.,
>>>> those
>>>> containing spaces) for the non-local invocations not not quote it
>>>> for
>>>> local invocations.
>>>>
>>>>
>>>>
>>>>
>>>> On Nov 19, 2007, at 9:19 AM, Ralph H Castain wrote:
>>>>
>>>>> My, you are joining late! ;-)
>>>>>
>>>>> The problem is with MCA params that take string arguments. If we
>>>>> pass the following:
>>>>>
>>>>> -mca foo "my beloved string"
>>>>>
>>>>> on the command line of an orted, we get a value for foo that
>>>>> includes the quote marks. I verified this rather painfully when
>>>>> attempting to pass a command line mca param for a uri. I
>>>>> eventually
>>>>> had to add specific code to clean the paramater up.
>>>>>
>>>>> This appears to be somehow related to the precise method used to
>>>>> register the param. For example, the following deprecated method
>>>>> works fine:
>>>>>
>>>>> On the setup end:
>>>>>
>>>>> opal_argv_append(argc, argv, "--gprreplica");
>>>>> if (NULL != orte_process_info.gpr_replica_uri) {
>>>>> contact_info = strdup(orte_process_info.gpr_replica_uri);
>>>>> } else {
>>>>> contact_info = orte_rml.get_contact_info();
>>>>> }
>>>>> asprintf(&param, "\"%s\"", contact_info);
>>>>> opal_argv_append(argc, argv, param);
>>>>>
>>>>> And on the receiving end:
>>>>> id = mca_base_param_register_string("gpr", "replica", "uri",
>>>>> NULL, orte_process_info.gpr_replica_uri);
>>>>> mca_base_param_lookup_string(id,
>>>>> &(orte_process_info.gpr_replica_uri));
>>>>> mca_base_param_set_internal(id, true);
>>>>>
>>>>>
>>>>> However, the following does NOT work cleanly:
>>>>>
>>>>> On the setup end:
>>>>> rml_uri = orte_rml.get_contact_info();
>>>>> asprintf(&param, "\"%s\"", rml_uri);
>>>>> opal_argv_append(argc, argv, "--hnp-uri");
>>>>> opal_argv_append(argc, argv, param);
>>>>>
>>>>> On the receiving end:
>>>>> mca_base_param_reg_string_name("orte", "hnp_uri",
>>>>> "HNP contact info",
>>>>> true, false, NULL, &uri);
>>>>>
>>>>> Thereby necessitating the addition of the following code to
>>>>> clean it
>>>>> up:
>>>>> if (NULL != uri) {
>>>>> /* the uri value passed to us will have quote marks around
>>>>> it to protect
>>>>> * the value if passed on the command line. We must remove
>>>>> those
>>>>> * to have a correct uri string
>>>>> */
>>>>> if ('"' == uri[0]) {
>>>>> /* if the first char is a quote, then so will the last
>>>>> one be */
>>>>> ptr = &uri[1];
>>>>> len = strlen(ptr) - 1;
>>>>> } else {
>>>>> ptr = &uri[0];
>>>>> len = strlen(uri);
>>>>> }
>>>>>
>>>>> /* we have to copy the string by hand as strndup is a GNU
>>>>> extension
>>>>> * and may not be generally available
>>>>> */
>>>>> orte_process_info.my_hnp_uri = (char*)malloc(len+1);
>>>>> for (i=0; i < len; i++) {
>>>>> orte_process_info.my_hnp_uri[i] = ptr[i];
>>>>> }
>>>>> orte_process_info.my_hnp_uri[len] = '\0'; /* NULL terminate
>>>>> */
>>>>> free(uri);
>>>>>
>>>>> }
>>>>>
>>>>> It was my understanding that you wanted us to move away from the
>>>>> deprecated interface – hence my comment that we cannot just quote
>>>>> all the strings as we would have to add this code all over the
>>>>> place, or (better) fix opal_cmd_line_parse.
>>>>>
>>>>> Hope that helps
>>>>> Ralph
>>>>>
>>>>>
>>>>> On 11/19/07 7:01 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>
>>>>>> Sorry -- I'm just joining this conversation late: what's the
>>>>>> problem
>>>>>> with opal_cmd_line_parse?
>>>>>>
>>>>>> It should obey all quoting from shells, etc. I.e., it shouldn't
>>>>> care
>>>>>> about tokens with special characters (to include spaces) because
>>>>>> the
>>>>>> shell divides all of that stuff up -- it just gets a char*[] that
>>>>>> it
>>>>>> treats as discrete tokens.
>>>>>>
>>>>>> Is it doing something wrong?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Nov 19, 2007, at 8:39 AM, Ralph H Castain wrote:
>>>>>>
>>>>>>> I'm not sure it is really necessary - the problem is solely
>>>>>>> within
>>>>>>> opal_cmd_line_parse and (if someone can parse that code ;-)) is
>>>>>>> truly simple
>>>>>>> to fix. The overly long cmd line issue is due to a bug that Josh
>>>>> was
>>>>>>> going
>>>>>>> to look at (may already have done so while I was out of touch).
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/9/07 5:10 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> Should there be another option for passing MCA parameters
>>>>>>>> between
>>>>>>>> processes, such as via stdin (or any file descriptor)? I.e.,
>>>>> during
>>>>>>>> the command line parsing to check for command line MCA params,
>>>>>>>> perhaps
>>>>>>>> a new argument could be introduced: -mcauri <uri>, where <uri>
>>>>> could
>>>>>>>> be a few different forms:
>>>>>>>>
>>>>>>>> - file://stdin: (note the 2 //, not 3, so "stdin" would never
>>>>>>>> conflict
>>>>>>>> with a real file named /stdin) Read the parameters in off
>>>>>>>> stdin.
>>>>>>>>
>>>>>>>> - rml://...rml contact info...: read in the MCA params via the
>>>>>>>> RML
>>>>>>>> (although I assume that reading via the RML would be *waaaay*
>>>>>>>> to
>>>>> late
>>>>>>>> during the MCA setup process -- I mentioned this option for
>>>>>>>> completeness, even though I don't think it'll work)
>>>>>>>>
>>>>>>>> - ip://ipaddress:port: open a socket back and read the MCA
>>>>> params in
>>>>>>>> over a socket. This could have some scalability issues...?
>>>>>>>> But
>>>>> who
>>>>>>>> knows; it could be tied into the hierarchical startup such that
>>>>>>>> we
>>>>>>>> wouldn't have to have an all-to-one connection scheme.
>>>>> Certainly it
>>>>>>>> would cause scalability problems when paired with today's all-
>>>>>>>> to-
>>>>> one
>>>>>>>> RML connection scheme for the OOB.
>>>>>>>>
>>>>>>>> I'm not sure that the rml: and ip: schemes are worthwhile.
>>>>> Maybe a
>>>>>>>> file://stdin kind of approach could work? Or perhaps some
>>>>>>>> other
>>>>> kind
>>>>>>>> of URI/IPC...? (I really haven't thought through the issues --
>>>>> this
>>>>>>>> is off the top of my head)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 8, 2007, at 2:36 PM, Ralph H Castain wrote:
>>>>>>>>
>>>>>>>>> Might I suggest:
>>>>>>>>>
>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/1073
>>>>>>>>>
>>>>>>>>> It deals with some of these issues and explains the boundaries
>>>>>>>>> of
>>>>>>>>> the
>>>>>>>>> problem. As for what a string param can contain, I have no
>>>>> opinion.
>>>>>>>>> I only
>>>>>>>>> note that it must handle special characters such as ';', '/',
>>>>> etc.
>>>>>>>>> that are
>>>>>>>>> typically found in uri's. I cannot think of any reason it
>>>>>>>>> should
>>>>>>>>> have a
>>>>>>>>> quote in it.
>>>>>>>>>
>>>>>>>>> Ralph
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/8/07 12:25 PM, "Tim Prins" <tprins_at_[hidden]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The alias option you presented does not work. I think we do
>>>>>>>>>> some
>>>>>>>>>> weird
>>>>>>>>>> things to find the absolute path for ssh, instead of just
>>>>> issuing
>>>>>>>>>> the
>>>>>>>>>> command.
>>>>>>>>>>
>>>>>>>>>> I would spend some time fixing this, but I don't want to do
>>>>>>>>>> it
>>>>>>>>>> wrong. We
>>>>>>>>>> could quote all the param values, and change the parser to
>>>>> remove
>>>>>>>>>> the
>>>>>>>>>> quotes, but this is assuming that a mca param does not
>>>>>>>>>> contain
>>>>>>>>>> quotes.
>>>>>>>>>>
>>>>>>>>>> So I guess there are 2 questions that need to be answered
>>>>> before a
>>>>>>>>>> fix
>>>>>>>>>> is made:
>>>>>>>>>>
>>>>>>>>>> 1. What exactly can a string mca param contain? Can it have
>>>>>>>>>> quotes or
>>>>>>>>>> spaces or?
>>>>>>>>>>
>>>>>>>>>> 2. Which mca parameters should be forwarded? Should it be
>>>>>>>>>> just
>>>>> the
>>>>>>>>>> ones
>>>>>>>>>> from the command line? From the environment? From config
>>>>>>>>>> files?
>>>>>>>>>>
>>>>>>>>>> Tim
>>>>>>>>>>
>>>>>>>>>> Ralph Castain wrote:
>>>>>>>>>>> What changed is that we never passed mca params to the orted
>>>>>>>>>>> before - they
>>>>>>>>>>> always went to the app, but it's the orted that has the
>>>>>>>>>>> issue.
>>>>>>>>>>> There is a
>>>>>>>>>>> bug ticket thread on this subject - I forget the number
>>>>>>>>>>> immediately.
>>>>>>>>>>>
>>>>>>>>>>> Basically, the problem was that we cannot generally pass the
>>>>> local
>>>>>>>>>>> environment to the orteds when we launch them. However,
>>>>>>>>>>> people
>>>>>>>>>>> needed
>>>>>>>>>>> various mca params to get to the orteds to control their
>>>>> behavior.
>>>>>>>>>>> The only
>>>>>>>>>>> way to resolve that problem was to pass the params via the
>>>>> command
>>>>>>>>>>> line,
>>>>>>>>>>> which is what was done.
>>>>>>>>>>>
>>>>>>>>>>> Except for a very few cases, all of our mca params are
>>>>>>>>>>> single
>>>>>>>>>>> values that do
>>>>>>>>>>> not include spaces, so this is not a problem that is causing
>>>>>>>>>>> widespread
>>>>>>>>>>> issues. As I said, I already had to deal with one special
>>>>>>>>>>> case
>>>>>>>>>>> that didn't
>>>>>>>>>>> involve spaces, but did have special characters that
>>>>>>>>>>> required
>>>>>>>>>>> quoting, which
>>>>>>>>>>> identified the larger problem of dealing with quoted
>>>>>>>>>>> strings.
>>>>>>>>>>>
>>>>>>>>>>> I have no objection to a more general fix. Like I said in my
>>>>> note,
>>>>>>>>>>> though,
>>>>>>>>>>> the general fix will take a larger effort. If someone is
>>>>> willing
>>>>>>>>>>> to do so,
>>>>>>>>>>> that is fine with me - I was only offering solutions that
>>>>>>>>>>> would
>>>>>>>>>>> fill the
>>>>>>>>>>> interim time as I haven't heard anyone step up to say they
>>>>> would
>>>>>>>>>>> fix it
>>>>>>>>>>> anytime soon.
>>>>>>>>>>>
>>>>>>>>>>> Please feel free to jump in and volunteer! ;-) I'm willing
>>>>>>>>>>> to
>>>>> put
>>>>>>>>>>> the quotes
>>>>>>>>>>> around things if you will fix the mca cmd line parser to
>>>>> cleanly
>>>>>>>>>>> remove them
>>>>>>>>>>> on the other end.
>>>>>>>>>>>
>>>>>>>>>>> Ralph
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 11/7/07 5:50 PM, "Tim Prins" <tprins_at_[hidden]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm curious what changed to make this a problem. How were
>>>>>>>>>>>> we
>>>>>>>>>>>> passing mca
>>>>>>>>>>>> param
>>>>>>>>>>>> from the base to the app before, and why did it change?
>>>>>>>>>>>>
>>>>>>>>>>>> I think that options 1 & 2 below are no good, since we, in
>>>>>>>>>>>> general, allow
>>>>>>>>>>>> string mca params to have spaces (as far as I understand
>>>>> it). So
>>>>>>>>>>>> a more
>>>>>>>>>>>> general approach is needed.
>>>>>>>>>>>>
>>>>>>>>>>>> Tim
>>>>>>>>>>>>
>>>>>>>>>>>> On Wednesday 07 November 2007 10:40:45 am Ralph H Castain
>>>>> wrote:
>>>>>>>>>>>>> Sorry for delay - wasn't ignoring the issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are several fixes to this problem - ranging in order
>>>>> from
>>>>>>>>>>>>> least to
>>>>>>>>>>>>> most work:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. just alias "ssh" to be "ssh -Y" and run without setting
>>>>> the
>>>>>>>>>>>>> mca param.
>>>>>>>>>>>>> It won't affect anything on the backend because the
>>>>>>>>>>>>> daemon/
>>>>> procs
>>>>>>>>>>>>> don't use
>>>>>>>>>>>>> ssh.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. include "pls_rsh_agent" in the array of mca params not
>>>>> to be
>>>>>>>>>>>>> passed to
>>>>>>>>>>>>> the orted in orte/mca/pls/base/
>>>>> pls_base_general_support_fns.c,
>>>>>>>>>>>>> the
>>>>>>>>>>>>> orte_pls_base_orted_append_basic_args function. This would
>>>>> fix
>>>>>>>>>>>>> the specific
>>>>>>>>>>>>> problem cited here, but I admit that listing every such
>>>>> param by
>>>>>>>>>>>>> name would
>>>>>>>>>>>>> get tedious.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. we could easily detect that a "problem" character was
>>>>>>>>>>>>> in
>>>>> the
>>>>>>>>>>>>> mca param
>>>>>>>>>>>>> value when we add it to the orted's argv, and then put ""
>>>>> around
>>>>>>>>>>>>> it. The
>>>>>>>>>>>>> problem, however, is that the mca param parser on the far
>>>>>>>>>>>>> end
>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>> remove those "" from the resulting string. At least, I
>>>>>>>>>>>>> spent
>>>>>>>>>>>>> over a day
>>>>>>>>>>>>> fighting with a problem only to discover that was
>>>>>>>>>>>>> happening.
>>>>>>>>>>>>> Could be an
>>>>>>>>>>>>> error in the way I was doing things, or could be a real
>>>>>>>>>>>>> characteristic of
>>>>>>>>>>>>> the parser. Anyway, we would have to ensure that the
>>>>>>>>>>>>> parser
>>>>>>>>>>>>> removes any
>>>>>>>>>>>>> surrounding "" before passing along the param value or
>>>>>>>>>>>>> this
>>>>>>>>>>>>> won't work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/5/07 12:10 PM, "Tim Prins" <tprins_at_[hidden]>
>>>>> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Commit 16364 broke things when using multiword mca param
>>>>>>>>>>>>> values. For
>>>>>>>>>>>>> instance:
>>>>>>>>>>>>>
>>>>>>>>>>>>> mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca
>>>>>>>>>>>>> pls_rsh_agent
>>>>>>>>>>>>> "ssh -Y" xterm
>>>>>>>>>>>>>
>>>>>>>>>>>>> Will crash and burn, because the value "ssh -Y" is being
>>>>> stored
>>>>>>>>>>>>> into the
>>>>>>>>>>>>> argv orted_cmd_line in orterun.c:1506. This is then added
>>>>>>>>>>>>> to
>>>>>>>>>>>>> the launch
>>>>>>>>>>>>> command for the orted:
>>>>>>>>>>>>>
>>>>>>>>>>>>> /usr/bin/ssh -Y odin004 PATH=/san/homedirs/tprins/usr/
>>>>>>>>>>>>> rsl/
>>>>> bin:
>>>>>>>>>>>>> $PATH ;
>>>>>>>>>>>>> export PATH ;
>>>>>>>>>>>>> LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:
>>>>>>>>>>>>> $LD_LIBRARY_PATH ;
>>>>>>>>>>>>> export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/
>>>>> orted
>>>>>>>>>>>>> --debug
>>>>>>>>>>>>> --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --
>>>>>>>>>>>>> nodename
>>>>>>>>>>>>> odin004 --universe tprins_at_[hidden]:default-
>>>>>>>>>>>>> universe-27872
>>>>>>>>>>>>> --nsreplica
>>>>>>>>>>>>> "0.0;tcp://
>>>>>>>>>>>>> 129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:
>>>>> 21a0
>>>>>>>>>>>>> :4090 8"
>>>>>>>>>>>>> --gprreplica
>>>>>>>>>>>>> "0.0;tcp://
>>>>>>>>>>>>> 129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:
>>>>> 21a0
>>>>>>>>>>>>> :4090 8"
>>>>>>>>>>>>> -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
>>>>>>>>>>>>> mca_base_param_file_path
>>>>>>>>>>>>> /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/
>>>>> homedirs/
>>>>>>>>>>>>> tprins/rsl/
>>>>>>>>>>>>> examp les
>>>>>>>>>>>>> -mca mca_base_param_file_path_force /san/homedirs/tprins/
>>>>> rsl/
>>>>>>>>>>>>> examples
>>>>>>>>>>>>>
>>>>>>>>>>>>> Notice that in this command we now have "-mca
>>>>> pls_rsh_agent ssh
>>>>>>>>>>>>> -Y". So
>>>>>>>>>>>>> the quotes have been lost, as we die a horrible death.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So we need to add the quotes back in somehow, or pass
>>>>>>>>>>>>> these
>>>>>>>>>>>>> options
>>>>>>>>>>>>> differently. I'm not sure what the best way to fix this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tim
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>
>>>>
>>

-- 
Jeff Squyres
Cisco Systems