Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bug report in plm_lsf_module.c
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-04-26 14:34:53


Appreciate your input! None of the developers have access to an LSF machine any more, so we can't test it :-/

What version of OMPI does this patch apply to? I can go ahead and add it - just want to know if it should just go to the trunk and 1.5 series, or also the 1.4 series.

Thanks again!
Ralph

On Apr 26, 2010, at 12:06 PM, Teng Lin wrote:

> Hi,
>
> We recently identify a bug in our LSF cluster.
> The job always hang if all LSF related components present. One observation we have is that the job works fine after removing all LSF related components.
>
> Below message from stdout:
> [xxxx:24930] mca: base: components_open: Looking for ess components
> [xxxx:24930] mca: base: components_open: opening ess components
> [xxxx:24930] mca: base: components_open: found loaded component env
> [xxxx:24930] mca: base: components_open: component env has no register function
> [xxxx:24930] mca: base: components_open: component env open function successful
> [xxxx:24930] mca: base: components_open: found loaded component hnp
> [xxxx:24930] mca: base: components_open: component hnp has no register function
> [xxxx:24930] mca: base: components_open: component hnp open function successful
> [xxxx:24930] mca: base: components_open: found loaded component lsf
> [xxxx:24930] mca: base: components_open: component lsf has no register function
> [xxxx:24930] mca: base: components_open: component lsf open function successful
> [xxxx:24930] mca: base: components_open: found loaded component singleton
> [xxxx:24930] mca: base: components_open: component singleton has no register function
> [xxxx:24930] mca: base: components_open: component singleton open function successful
> [xxxx:24930] mca: base: components_open: found loaded component slurm
> [xxxx:24930] mca: base: components_open: component slurm has no register function
> [xxxx:24930] mca: base: components_open: component slurm open function successful
> [xxxx:24930] mca: base: components_open: found loaded component tool
> [xxxx:24930] mca: base: components_open: component tool has no register function
> [xxxx:24930] mca: base: components_open: component tool open function successful
> [xxxx:24930] mca: base: components_open: Looking for plm components
> [xxxx:24930] mca: base: components_open: opening plm components
> [xxxx:24930] mca: base: components_open: found loaded component lsf
> [xxxx:24930] mca: base: components_open: component lsf has no register function
> [xxxx:24930] mca: base: components_open: component lsf open function successful
> [xxxx:24930] mca: base: components_open: found loaded component rsh
> [xxxx:24930] mca: base: components_open: component rsh has no register function
> [xxxx:24930] mca: base: components_open: component rsh open function successful
> [xxxx:24930] mca: base: components_open: found loaded component slurm
> [xxxx:24930] mca: base: components_open: component slurm has no register function
> [xxxx:24930] mca: base: components_open: component slurm open function successful
> [xxxx:24930] mca:base:select: Auto-selecting plm components
> [xxxx:24930] mca:base:select:( plm) Querying component [lsf]
> [xxxx:24930] mca:base:select:( plm) Query of component [lsf] set priority to 75
> [xxxx:24930] mca:base:select:( plm) Querying component [rsh]
> [xxxx:24930] mca:base:select:( plm) Query of component [rsh] set priority to 10
> [xxxx:24930] mca:base:select:( plm) Querying component [slurm]
> [xxxx:24930] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
> [xxxx:24930] mca:base:select:( plm) Selected component [lsf]
> [xxxx:24930] mca: base: close: component rsh closed
> [xxxx:24930] mca: base: close: unloading component rsh
> [xxxx:24930] mca: base: close: component slurm closed
> [xxxx:24930] mca: base: close: unloading component slurm
> [xxxx:24930] mca: base: components_open: Looking for rml components
> [xxxx:24930] mca: base: components_open: opening rml components
> [xxxx:24930] mca: base: components_open: found loaded component oob
> [xxxx:24930] mca: base: components_open: component oob has no register function
> [xxxx:24930] mca: base: components_open: Looking for oob components
> [xxxx:24930] mca: base: components_open: opening oob components
> [xxxx:24930] mca: base: components_open: found loaded component tcp
> [xxxx:24930] mca: base: components_open: component tcp has no register function
> [xxxx:24930] mca: base: components_open: component tcp open function successful
> [xxxx:24930] mca: base: components_open: component oob open function successful
> [xxxx:24930] orte_rml_base_select: initializing rml component oob
> [xxxx:24930] mca: base: components_open: Looking for ras components
> [xxxx:24930] mca: base: components_open: opening ras components
> [xxxx:24930] mca: base: components_open: found loaded component lsf
> [xxxx:24930] mca: base: components_open: component lsf has no register function
> [xxxx:24930] mca: base: components_open: component lsf open function successful
> [xxxx:24930] mca: base: components_open: found loaded component slurm
> [xxxx:24930] mca: base: components_open: component slurm has no register function
> [xxxx:24930] mca: base: components_open: component slurm open function successful
> [xxxx:24930] mca:base:select: Auto-selecting ras components
> [xxxx:24930] mca:base:select:( ras) Querying component [lsf]
> [xxxx:24930] mca:base:select:( ras) Query of component [lsf] set priority to 75
> [xxxx:24930] mca:base:select:( ras) Querying component [slurm]
> [xxxx:24930] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module
> [xxxx:24930] mca:base:select:( ras) Selected component [lsf]
> [xxxx:24930] mca: base: close: unloading component slurm
> [xxxx:24930] plm:lsf: final top-level argv:
> [xxxx:24930] plm:lsf: orted -mca ess lsf -mca orte_ess_jobid 2605449216 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2605449216.0;tcp://xxx.xxx.xxx.xxx:57649"
>
>
> Below message is from the log file from res daemon:
> Apr 22 15:52:01 2010 6540 3 7.06 execAtask_: lsfExecvp() failed.
> Apr 22 15:52:01 2010 6540 3 7.06 rexecChild: execAtask_() failed, No such file or directory.
>
> Above messages suggest that orted is not in the path.
>
> Applying below patch seem to fix the problem.
>
> --- plm_lsf_module.c.orig 2010-04-26 13:27:59.699974000 -0400
> +++ plm_lsf_module.c 2010-04-26 10:58:24.719737000 -0400
> @@ -304,7 +304,7 @@
> * orterun can do the rest of its stuff. Instead, we'll catch any
> * failures and deal with them elsewhere
> */
> - if (lsb_launch(nodelist_argv, argv, LSF_DJOB_NOWAIT, env) < 0) {
> + if (lsb_launch(nodelist_argv, argv, LSF_DJOB_REPLACE_ENV | LSF_DJOB_NOWAIT, env) < 0) {
> ORTE_ERROR_LOG(ORTE_ERR_FAILED_TO_START);
> opal_output(0, "lsb_launch failed: %d", rc);
> rc = ORTE_ERR_FAILED_TO_START;
>
> If the LSF_DJOB_REPLACE_ENV option is specified, envp entries will overwrite all existing environment values except those needed by LSF.
> If the function fails, lsberrno is set to indicate the error. It would be useful if we can
> One thing we can not guarantee is that orted is in the path of remote node. LSF_DJOB_REPLACE_ENV can certainly be used to overcome this. But it may also have some side effect.
>
> There are few things that still not quite clear to us. lsb_launch supposes to return a negative number, not sure why it did not in our case.
>
>
> Not sure if it related to change set 19033 (https://svn.open-mpi.org/trac/ompi/changeset/19033) in certain way.
>
>
> Teng
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users