Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Multiworld MCA parameter values broken
From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-11-19 09:19:17


My, you are joining late! ;-)

The problem is with MCA params that take string arguments. If we pass the
following:

-mca foo "my beloved string"

on the command line of an orted, we get a value for foo that includes the
quote marks. I verified this rather painfully when attempting to pass a
command line mca param for a uri. I eventually had to add specific code to
clean the paramater up.

This appears to be somehow related to the precise method used to register
the param. For example, the following deprecated method works fine:

On the setup end:

    opal_argv_append(argc, argv, "--gprreplica");
    if (NULL != orte_process_info.gpr_replica_uri) {
        contact_info = strdup(orte_process_info.gpr_replica_uri);
    } else {
        contact_info = orte_rml.get_contact_info();
    }
    asprintf(&param, "\"%s\"", contact_info);
    opal_argv_append(argc, argv, param);

And on the receiving end:
    id = mca_base_param_register_string("gpr", "replica", "uri", NULL,
orte_process_info.gpr_replica_uri);
    mca_base_param_lookup_string(id, &(orte_process_info.gpr_replica_uri));
    mca_base_param_set_internal(id, true);

 
However, the following does NOT work cleanly:

On the setup end:
    rml_uri = orte_rml.get_contact_info();
    asprintf(&param, "\"%s\"", rml_uri);
    opal_argv_append(argc, argv, "--hnp-uri");
    opal_argv_append(argc, argv, param);

On the receiving end:
    mca_base_param_reg_string_name("orte", "hnp_uri",
                                   "HNP contact info",
                                   true, false, NULL, &uri);

Thereby necessitating the addition of the following code to clean it up:
    if (NULL != uri) {
        /* the uri value passed to us will have quote marks around it to
protect
        * the value if passed on the command line. We must remove those
        * to have a correct uri string
        */
        if ('"' == uri[0]) {
            /* if the first char is a quote, then so will the last one be */
            ptr = &uri[1];
            len = strlen(ptr) - 1;
        } else {
            ptr = &uri[0];
            len = strlen(uri);
        }
        
        /* we have to copy the string by hand as strndup is a GNU extension
         * and may not be generally available
         */
        orte_process_info.my_hnp_uri = (char*)malloc(len+1);
        for (i=0; i < len; i++) {
            orte_process_info.my_hnp_uri[i] = ptr[i];
        }
        orte_process_info.my_hnp_uri[len] = '\0'; /* NULL terminate */
        free(uri);

    }

It was my understanding that you wanted us to move away from the deprecated
interface ­ hence my comment that we cannot just quote all the strings as we
would have to add this code all over the place, or (better) fix
opal_cmd_line_parse.

Hope that helps
Ralph

On 11/19/07 7:01 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:

> Sorry -- I'm just joining this conversation late: what's the problem
> with opal_cmd_line_parse?
>
> It should obey all quoting from shells, etc. I.e., it shouldn't care
> about tokens with special characters (to include spaces) because the
> shell divides all of that stuff up -- it just gets a char*[] that it
> treats as discrete tokens.
>
> Is it doing something wrong?
>
>
>
> On Nov 19, 2007, at 8:39 AM, Ralph H Castain wrote:
>
>> I'm not sure it is really necessary - the problem is solely within
>> opal_cmd_line_parse and (if someone can parse that code ;-)) is
>> truly simple
>> to fix. The overly long cmd line issue is due to a bug that Josh was
>> going
>> to look at (may already have done so while I was out of touch).
>>
>> Ralph
>>
>>
>>
>> On 11/9/07 5:10 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>
>>> Should there be another option for passing MCA parameters between
>>> processes, such as via stdin (or any file descriptor)? I.e., during
>>> the command line parsing to check for command line MCA params,
>>> perhaps
>>> a new argument could be introduced: -mcauri <uri>, where <uri> could
>>> be a few different forms:
>>>
>>> - file://stdin: (note the 2 //, not 3, so "stdin" would never
>>> conflict
>>> with a real file named /stdin) Read the parameters in off stdin.
>>>
>>> - rml://...rml contact info...: read in the MCA params via the RML
>>> (although I assume that reading via the RML would be *waaaay* to late
>>> during the MCA setup process -- I mentioned this option for
>>> completeness, even though I don't think it'll work)
>>>
>>> - ip://ipaddress:port: open a socket back and read the MCA params in
>>> over a socket. This could have some scalability issues...? But who
>>> knows; it could be tied into the hierarchical startup such that we
>>> wouldn't have to have an all-to-one connection scheme. Certainly it
>>> would cause scalability problems when paired with today's all-to-one
>>> RML connection scheme for the OOB.
>>>
>>> I'm not sure that the rml: and ip: schemes are worthwhile. Maybe a
>>> file://stdin kind of approach could work? Or perhaps some other kind
>>> of URI/IPC...? (I really haven't thought through the issues -- this
>>> is off the top of my head)
>>>
>>>
>>>
>>> On Nov 8, 2007, at 2:36 PM, Ralph H Castain wrote:
>>>
>>>> Might I suggest:
>>>>
>>>> https://svn.open-mpi.org/trac/ompi/ticket/1073
>>>>
>>>> It deals with some of these issues and explains the boundaries of
>>>> the
>>>> problem. As for what a string param can contain, I have no opinion.
>>>> I only
>>>> note that it must handle special characters such as ';', '/', etc.
>>>> that are
>>>> typically found in uri's. I cannot think of any reason it should
>>>> have a
>>>> quote in it.
>>>>
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On 11/8/07 12:25 PM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>>>
>>>>> The alias option you presented does not work. I think we do some
>>>>> weird
>>>>> things to find the absolute path for ssh, instead of just issuing
>>>>> the
>>>>> command.
>>>>>
>>>>> I would spend some time fixing this, but I don't want to do it
>>>>> wrong. We
>>>>> could quote all the param values, and change the parser to remove
>>>>> the
>>>>> quotes, but this is assuming that a mca param does not contain
>>>>> quotes.
>>>>>
>>>>> So I guess there are 2 questions that need to be answered before a
>>>>> fix
>>>>> is made:
>>>>>
>>>>> 1. What exactly can a string mca param contain? Can it have
>>>>> quotes or
>>>>> spaces or?
>>>>>
>>>>> 2. Which mca parameters should be forwarded? Should it be just the
>>>>> ones
>>>>> from the command line? From the environment? From config files?
>>>>>
>>>>> Tim
>>>>>
>>>>> Ralph Castain wrote:
>>>>>> What changed is that we never passed mca params to the orted
>>>>>> before - they
>>>>>> always went to the app, but it's the orted that has the issue.
>>>>>> There is a
>>>>>> bug ticket thread on this subject - I forget the number
>>>>>> immediately.
>>>>>>
>>>>>> Basically, the problem was that we cannot generally pass the local
>>>>>> environment to the orteds when we launch them. However, people
>>>>>> needed
>>>>>> various mca params to get to the orteds to control their behavior.
>>>>>> The only
>>>>>> way to resolve that problem was to pass the params via the command
>>>>>> line,
>>>>>> which is what was done.
>>>>>>
>>>>>> Except for a very few cases, all of our mca params are single
>>>>>> values that do
>>>>>> not include spaces, so this is not a problem that is causing
>>>>>> widespread
>>>>>> issues. As I said, I already had to deal with one special case
>>>>>> that didn't
>>>>>> involve spaces, but did have special characters that required
>>>>>> quoting, which
>>>>>> identified the larger problem of dealing with quoted strings.
>>>>>>
>>>>>> I have no objection to a more general fix. Like I said in my note,
>>>>>> though,
>>>>>> the general fix will take a larger effort. If someone is willing
>>>>>> to do so,
>>>>>> that is fine with me - I was only offering solutions that would
>>>>>> fill the
>>>>>> interim time as I haven't heard anyone step up to say they would
>>>>>> fix it
>>>>>> anytime soon.
>>>>>>
>>>>>> Please feel free to jump in and volunteer! ;-) I'm willing to put
>>>>>> the quotes
>>>>>> around things if you will fix the mca cmd line parser to cleanly
>>>>>> remove them
>>>>>> on the other end.
>>>>>>
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/7/07 5:50 PM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>>>>>
>>>>>>> I'm curious what changed to make this a problem. How were we
>>>>>>> passing mca
>>>>>>> param
>>>>>>> from the base to the app before, and why did it change?
>>>>>>>
>>>>>>> I think that options 1 & 2 below are no good, since we, in
>>>>>>> general, allow
>>>>>>> string mca params to have spaces (as far as I understand it). So
>>>>>>> a more
>>>>>>> general approach is needed.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>> On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote:
>>>>>>>> Sorry for delay - wasn't ignoring the issue.
>>>>>>>>
>>>>>>>> There are several fixes to this problem - ranging in order from
>>>>>>>> least to
>>>>>>>> most work:
>>>>>>>>
>>>>>>>> 1. just alias "ssh" to be "ssh -Y" and run without setting the
>>>>>>>> mca param.
>>>>>>>> It won't affect anything on the backend because the daemon/procs
>>>>>>>> don't use
>>>>>>>> ssh.
>>>>>>>>
>>>>>>>> 2. include "pls_rsh_agent" in the array of mca params not to be
>>>>>>>> passed to
>>>>>>>> the orted in orte/mca/pls/base/pls_base_general_support_fns.c,
>>>>>>>> the
>>>>>>>> orte_pls_base_orted_append_basic_args function. This would fix
>>>>>>>> the specific
>>>>>>>> problem cited here, but I admit that listing every such param by
>>>>>>>> name would
>>>>>>>> get tedious.
>>>>>>>>
>>>>>>>> 3. we could easily detect that a "problem" character was in the
>>>>>>>> mca param
>>>>>>>> value when we add it to the orted's argv, and then put "" around
>>>>>>>> it. The
>>>>>>>> problem, however, is that the mca param parser on the far end
>>>>>>>> doesn't
>>>>>>>> remove those "" from the resulting string. At least, I spent
>>>>>>>> over a day
>>>>>>>> fighting with a problem only to discover that was happening.
>>>>>>>> Could be an
>>>>>>>> error in the way I was doing things, or could be a real
>>>>>>>> characteristic of
>>>>>>>> the parser. Anyway, we would have to ensure that the parser
>>>>>>>> removes any
>>>>>>>> surrounding "" before passing along the param value or this
>>>>>>>> won't work.
>>>>>>>>
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>> On 11/5/07 12:10 PM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Commit 16364 broke things when using multiword mca param
>>>>>>>>> values. For
>>>>>>>>> instance:
>>>>>>>>>
>>>>>>>>> mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca
>>>>>>>>> pls_rsh_agent
>>>>>>>>> "ssh -Y" xterm
>>>>>>>>>
>>>>>>>>> Will crash and burn, because the value "ssh -Y" is being stored
>>>>>>>>> into the
>>>>>>>>> argv orted_cmd_line in orterun.c:1506. This is then added to
>>>>>>>>> the launch
>>>>>>>>> command for the orted:
>>>>>>>>>
>>>>>>>>> /usr/bin/ssh -Y odin004 PATH=/san/homedirs/tprins/usr/rsl/bin:
>>>>>>>>> $PATH ;
>>>>>>>>> export PATH ;
>>>>>>>>> LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:
>>>>>>>>> $LD_LIBRARY_PATH ;
>>>>>>>>> export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted
>>>>>>>>> --debug
>>>>>>>>> --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --
>>>>>>>>> nodename
>>>>>>>>> odin004 --universe tprins_at_[hidden]:default-
>>>>>>>>> universe-27872
>>>>>>>>> --nsreplica
>>>>>>>>> "0.0;tcp://
>>>>>>>>> 129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
>>>>>>>>> :4090 8"
>>>>>>>>> --gprreplica
>>>>>>>>> "0.0;tcp://
>>>>>>>>> 129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
>>>>>>>>> :4090 8"
>>>>>>>>> -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
>>>>>>>>> mca_base_param_file_path
>>>>>>>>> /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/
>>>>>>>>> tprins/rsl/
>>>>>>>>> examp les
>>>>>>>>> -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/
>>>>>>>>> examples
>>>>>>>>>
>>>>>>>>> Notice that in this command we now have "-mca pls_rsh_agent ssh
>>>>>>>>> -Y". So
>>>>>>>>> the quotes have been lost, as we die a horrible death.
>>>>>>>>>
>>>>>>>>> So we need to add the quotes back in somehow, or pass these
>>>>>>>>> options
>>>>>>>>> differently. I'm not sure what the best way to fix this.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Tim
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>