Just for the sake of it. A funy command line to try:
[bosilca@dancer ~]$ mpirun --mca routed_base_verbose 0 --leave-session-attached -np 1 --mca orte_launch_agent "orted --mca routed_base_verbose 1" uptime
[node03:22355] [[14661,0],1] routed_linear: init routes for daemon job [14661,0]
hnp_uri 960823296.0;tcp://192.168.1.254:58135;tcp://192.168.0.2:58135
18:02:59 up 26 days, 17:41, 0 users, load average: 0.97, 0.50, 0.53
[bosilca@dancer ~]$ [node03:22355] [[14661,0],1] routed_linear_get([[14661,0],0]) --> [[14661,0],0]
[node03:22355] [[14661,0],1] routed_linear: init routes for daemon job [14661,0]
hnp_uri 960823296.0;tcp://192.168.1.254:58135;tcp://192.168.0.2:58135
[node03:22355] [[14661,0],1] routed_linear_get([[14661,0],0]) --> [[14661,0],0]
[node03:22355] [[14661,0],1] routed_linear_get([[14661,0],0]) --> [[14661,0],0]
[node03:22355] [[14661,0],1] routed_linear_get([[14661,0],0]) --> [[14661,0],0]
This set the routed_base_verbose to zero for the HNP, and to 1 for everybody else. As you can see from the output the orted output routed information which means it correctly interpreted the multiword argument.
george.
On Jun 24, 2009, at 17:52 , George Bosilca wrote:On Jun 24, 2009, at 17:41 , Jeff Squyres wrote:-----[14:38] svbu-mpi:~/svn/ompi/orte % mpirun --mca plm_base_verbose 100 --leave-session-attached -np 1 --mca orte_launch_agent "$bogus/bin/orted -s" uptime...lots of output...srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=svbu-mpi062 /home/jsquyres/bogus/bin/orted -s -mca ess slurm -mca orte_ess_jobid 3195142144 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "3195142144.0;tcp://172.29.218.140:34489;tcp://10.10.20.250:34489;tcp://10.10.30.250:34489;tcp://192.168.183.1:34489;tcp://192.168.184.1:34489" -mca orte_nodelist svbu-mpi062 --mca plm_base_verbose 100 --mca orte_launch_agent "/home/jsquyres/bogus/bin/orted -s"...-----and it hangs, because the argv[0]"/home/jsquyres/bogus/bin/orted -s"(including the quotes!) cannot be exec'ed.OK so maybe the -s option was a bad example (it's the one I use regularly). It block the orted, you will have to log on each node, attach with gdb to the orted, and release them by doing a "set orted_spin_flag=0".george.On Jun 24, 2009, at 5:15 PM, George Bosilca wrote:I can't guarantee this for all PLM but I can confirm that rsh andslurm (1.3.12) works well with this.We try with and without Open MPI, and the outcome is the same.[bosilca@dancer c]$ srun -n 4 echo "1 2 3 4 5 it works"1 2 3 4 5 it works1 2 3 4 5 it works1 2 3 4 5 it works1 2 3 4 5 it works[bosilca@dancer c]$ srun -N 2 -c 2 mpirun --mca plm slurm --mcaorte_launch_agent "orted -s" --mca plm_rsh_tree_spawn 1 --bynode --mcapml ob1 --mca orte_daemon_spin 0 ./helloHello, world, I am 0 of 2 on node03Hello, world, I am 1 of 2 on node04*after releasing the orted from their spin.In fact what I find strange is the old behavior. Dropping argumentswithout even letting the user know about it, is certainly not adesirable approach.george.On Jun 24, 2009, at 16:15 , Ralph Castain wrote:> Yo George>> This commit is going to break non-rsh launchers. While it is true> that the rsh launcher may handle multi-word options by putting them> in quotes, we specifically avoided it here because it breaks SLURM,> Torque, and others.>> This is why we specifically put the inclusion of multi-word options> in the rsh plm module, and not here. Would you please move it back> there?>> Thanks> Ralph>>> On Wed, Jun 24, 2009 at 1:51 PM, <bosilca@osl.iu.edu> wrote:> Author: bosilca> Date: 2009-06-24 15:51:52 EDT (Wed, 24 Jun 2009)> New Revision: 21513> URL: https://svn.open-mpi.org/trac/ompi/changeset/21513>> Log:> When we get a report from an orted about its state, don't use the> sender of> the message to update the structures, but instead use the> information from> the URI. The reason is that even the launch report messages can get> routed.>> Deal with the orted_cmd_line in a single location.>> Text files modified:> trunk/orte/mca/plm/base/plm_base_launch_support.c | 69 +++++++++> ++++++++++++++----------------> 1 files changed, 41 insertions(+), 28 deletions(-)>> Modified: trunk/orte/mca/plm/base/plm_base_launch_support.c> => => => => => => => => ======================================================================> --- trunk/orte/mca/plm/base/plm_base_launch_support.c (original)> +++ trunk/orte/mca/plm/base/plm_base_launch_support.c 2009-06-24> 15:51:52 EDT (Wed, 24 Jun 2009)> @@ -433,7 +433,8 @@> {> orte_message_event_t *mev = (orte_message_event_t*)data;> opal_buffer_t *buffer = mev->buffer;> - char *rml_uri;> + orte_process_name_t peer;> + char *rml_uri = NULL;> int rc, idx;> int32_t arch;> orte_node_t **nodes;> @@ -442,19 +443,11 @@> int64_t setupsec, setupusec;> int64_t startsec, startusec;>> - OPAL_OUTPUT_VERBOSE((5, orte_plm_globals.output,> - "%s plm:base:orted_report_launch from> daemon %s",> - ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),> - ORTE_NAME_PRINT(&mev->sender)));> -> /* see if we need to timestamp this receipt */> if (orte_timing) {> gettimeofday(&recvtime, NULL);> }>> - /* update state */> - pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;> -> /* unpack its contact info */> idx = 1;> if (ORTE_SUCCESS != (rc = opal_dss.unpack(buffer, &rml_uri,> &idx, OPAL_STRING))) {> @@ -466,13 +459,26 @@> /* set the contact info into the hash table */> if (ORTE_SUCCESS != (rc = orte_rml.set_contact_info(rml_uri))) {> ORTE_ERROR_LOG(rc);> - free(rml_uri);> orted_failed_launch = true;> goto CLEANUP;> }> - /* lookup and record this daemon's contact info */> - pdatorted[mev->sender.vpid]->rml_uri = strdup(rml_uri);> - free(rml_uri);> +> + rc = orte_rml_base_parse_uris(rml_uri, &peer, NULL );> + if( ORTE_SUCCESS != rc ) {> + ORTE_ERROR_LOG(rc);> + orted_failed_launch = true;> + goto CLEANUP;> + }> +> + OPAL_OUTPUT_VERBOSE((5, orte_plm_globals.output,> + "%s plm:base:orted_report_launch from> daemon %s via %s",> + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),> + ORTE_NAME_PRINT(&peer),> + ORTE_NAME_PRINT(&mev->sender)));> +> + /* update state and record for this daemon contact info */> + pdatorted[peer.vpid]->state = ORTE_PROC_STATE_RUNNING;> + pdatorted[peer.vpid]->rml_uri = rml_uri;>> /* get the remote arch */> idx = 1;> @@ -555,31 +561,33 @@>> /* lookup the node */> nodes = (orte_node_t**)orte_node_pool->addr;> - if (NULL == nodes[mev->sender.vpid]) {> + if (NULL == nodes[peer.vpid]) {> ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);> orted_failed_launch = true;> goto CLEANUP;> }> /* store the arch */> - nodes[mev->sender.vpid]->arch = arch;> + nodes[peer.vpid]->arch = arch;>> /* if a tree-launch is underway, send the cmd back */> if (NULL != orte_tree_launch_cmd) {> - orte_rml.send_buffer(&mev->sender, orte_tree_launch_cmd,> ORTE_RML_TAG_DAEMON, 0);> + orte_rml.send_buffer(&peer, orte_tree_launch_cmd,> ORTE_RML_TAG_DAEMON, 0);> }>> CLEANUP:>> OPAL_OUTPUT_VERBOSE((5, orte_plm_globals.output,> - "%s plm:base:orted_report_launch %s for> daemon %s at contact %s",> + "%s plm:base:orted_report_launch %s for> daemon %s (via %s) at contact %s",> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),> orted_failed_launch ? "failed" : "completed",> - ORTE_NAME_PRINT(&mev->sender),> pdatorted[mev->sender.vpid]->rml_uri));> + ORTE_NAME_PRINT(&peer),> + ORTE_NAME_PRINT(&mev->sender),> pdatorted[peer.vpid]->rml_uri));>> /* release the message */> OBJ_RELEASE(mev);>> if (orted_failed_launch) {> + if( NULL != rml_uri ) free(rml_uri);> orte_errmgr.incomplete_start(ORTE_PROC_MY_NAME->jobid,> ORTE_ERROR_DEFAULT_EXIT_CODE);> } else {> orted_num_callback++;> @@ -1133,18 +1141,23 @@> * being sure to "purge" any that would cause problems> * on backend nodes> */> - if (ORTE_PROC_IS_HNP) {> + if (ORTE_PROC_IS_HNP || ORTE_PROC_IS_DAEMON) {> cnt = opal_argv_count(orted_cmd_line);> for (i=0; i < cnt; i+=3) {> - /* if the specified option is more than one word, we> don't> - * have a generic way of passing it as some> environments ignore> - * any quotes we add, while others don't - so we ignore> any> - * such options. In most cases, this won't be a problem> as> - * they typically only apply to things of interest to> the HNP.> - * Individual environments can add these back into the> cmd line> - * as they know if it can be supported> - */> - if (NULL != strchr(orted_cmd_line[i+2], ' ')) {> + /* in the rsh environment, we can append multi-word> arguments> + * by enclosing them in quotes. Check for any multi-word> + * mca params passed to mpirun and include them> + */> + if (NULL != strchr(orted_cmd_line[i+2], ' ')) {> + char* param;> +> + /* must add quotes around it */> + asprintf(¶m, "\"%s\"", orted_cmd_line[i+2]);> + /* now pass it along */> + opal_argv_append(argc, argv, orted_cmd_line[i]);> + opal_argv_append(argc, argv, orted_cmd_line[i+1]);> + opal_argv_append(argc, argv, param);> + free(param);> continue;> }> /* The daemon will attempt to open the PLM on the remote> _______________________________________________> svn mailing list> svn@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/svn>> _______________________________________________> devel mailing list> devel@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/devel_______________________________________________devel mailing listdevel@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel--Jeff SquyresCisco Systems_______________________________________________devel mailing listdevel@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel_______________________________________________devel mailing listdevel@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel