Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Bug report in plm_lsf_module.c
From: Teng Lin (teng.lin_at_[hidden])
Date: 2010-04-26 14:06:56


Hi,

We recently identify a bug in our LSF cluster.
The job always hang if all LSF related components present. One observation we have is that the job works fine after removing all LSF related components.

Below message from stdout:
[xxxx:24930] mca: base: components_open: Looking for ess components
[xxxx:24930] mca: base: components_open: opening ess components
[xxxx:24930] mca: base: components_open: found loaded component env
[xxxx:24930] mca: base: components_open: component env has no register function
[xxxx:24930] mca: base: components_open: component env open function successful
[xxxx:24930] mca: base: components_open: found loaded component hnp
[xxxx:24930] mca: base: components_open: component hnp has no register function
[xxxx:24930] mca: base: components_open: component hnp open function successful
[xxxx:24930] mca: base: components_open: found loaded component lsf
[xxxx:24930] mca: base: components_open: component lsf has no register function
[xxxx:24930] mca: base: components_open: component lsf open function successful
[xxxx:24930] mca: base: components_open: found loaded component singleton
[xxxx:24930] mca: base: components_open: component singleton has no register function
[xxxx:24930] mca: base: components_open: component singleton open function successful
[xxxx:24930] mca: base: components_open: found loaded component slurm
[xxxx:24930] mca: base: components_open: component slurm has no register function
[xxxx:24930] mca: base: components_open: component slurm open function successful
[xxxx:24930] mca: base: components_open: found loaded component tool
[xxxx:24930] mca: base: components_open: component tool has no register function
[xxxx:24930] mca: base: components_open: component tool open function successful
[xxxx:24930] mca: base: components_open: Looking for plm components
[xxxx:24930] mca: base: components_open: opening plm components
[xxxx:24930] mca: base: components_open: found loaded component lsf
[xxxx:24930] mca: base: components_open: component lsf has no register function
[xxxx:24930] mca: base: components_open: component lsf open function successful
[xxxx:24930] mca: base: components_open: found loaded component rsh
[xxxx:24930] mca: base: components_open: component rsh has no register function
[xxxx:24930] mca: base: components_open: component rsh open function successful
[xxxx:24930] mca: base: components_open: found loaded component slurm
[xxxx:24930] mca: base: components_open: component slurm has no register function
[xxxx:24930] mca: base: components_open: component slurm open function successful
[xxxx:24930] mca:base:select: Auto-selecting plm components
[xxxx:24930] mca:base:select:( plm) Querying component [lsf]
[xxxx:24930] mca:base:select:( plm) Query of component [lsf] set priority to 75
[xxxx:24930] mca:base:select:( plm) Querying component [rsh]
[xxxx:24930] mca:base:select:( plm) Query of component [rsh] set priority to 10
[xxxx:24930] mca:base:select:( plm) Querying component [slurm]
[xxxx:24930] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[xxxx:24930] mca:base:select:( plm) Selected component [lsf]
[xxxx:24930] mca: base: close: component rsh closed
[xxxx:24930] mca: base: close: unloading component rsh
[xxxx:24930] mca: base: close: component slurm closed
[xxxx:24930] mca: base: close: unloading component slurm
[xxxx:24930] mca: base: components_open: Looking for rml components
[xxxx:24930] mca: base: components_open: opening rml components
[xxxx:24930] mca: base: components_open: found loaded component oob
[xxxx:24930] mca: base: components_open: component oob has no register function
[xxxx:24930] mca: base: components_open: Looking for oob components
[xxxx:24930] mca: base: components_open: opening oob components
[xxxx:24930] mca: base: components_open: found loaded component tcp
[xxxx:24930] mca: base: components_open: component tcp has no register function
[xxxx:24930] mca: base: components_open: component tcp open function successful
[xxxx:24930] mca: base: components_open: component oob open function successful
[xxxx:24930] orte_rml_base_select: initializing rml component oob
[xxxx:24930] mca: base: components_open: Looking for ras components
[xxxx:24930] mca: base: components_open: opening ras components
[xxxx:24930] mca: base: components_open: found loaded component lsf
[xxxx:24930] mca: base: components_open: component lsf has no register function
[xxxx:24930] mca: base: components_open: component lsf open function successful
[xxxx:24930] mca: base: components_open: found loaded component slurm
[xxxx:24930] mca: base: components_open: component slurm has no register function
[xxxx:24930] mca: base: components_open: component slurm open function successful
[xxxx:24930] mca:base:select: Auto-selecting ras components
[xxxx:24930] mca:base:select:( ras) Querying component [lsf]
[xxxx:24930] mca:base:select:( ras) Query of component [lsf] set priority to 75
[xxxx:24930] mca:base:select:( ras) Querying component [slurm]
[xxxx:24930] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module
[xxxx:24930] mca:base:select:( ras) Selected component [lsf]
[xxxx:24930] mca: base: close: unloading component slurm
[xxxx:24930] plm:lsf: final top-level argv:
[xxxx:24930] plm:lsf: orted -mca ess lsf -mca orte_ess_jobid 2605449216 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2605449216.0;tcp://xxx.xxx.xxx.xxx:57649"

Below message is from the log file from res daemon:
Apr 22 15:52:01 2010 6540 3 7.06 execAtask_: lsfExecvp() failed.
Apr 22 15:52:01 2010 6540 3 7.06 rexecChild: execAtask_() failed, No such file or directory.

Above messages suggest that orted is not in the path.

Applying below patch seem to fix the problem.

--- plm_lsf_module.c.orig 2010-04-26 13:27:59.699974000 -0400
+++ plm_lsf_module.c 2010-04-26 10:58:24.719737000 -0400
@@ -304,7 +304,7 @@
      * orterun can do the rest of its stuff. Instead, we'll catch any
      * failures and deal with them elsewhere
      */
- if (lsb_launch(nodelist_argv, argv, LSF_DJOB_NOWAIT, env) < 0) {
+ if (lsb_launch(nodelist_argv, argv, LSF_DJOB_REPLACE_ENV | LSF_DJOB_NOWAIT, env) < 0) {
         ORTE_ERROR_LOG(ORTE_ERR_FAILED_TO_START);
         opal_output(0, "lsb_launch failed: %d", rc);
         rc = ORTE_ERR_FAILED_TO_START;

If the LSF_DJOB_REPLACE_ENV option is specified, envp entries will overwrite all existing environment values except those needed by LSF.
If the function fails, lsberrno is set to indicate the error. It would be useful if we can
One thing we can not guarantee is that orted is in the path of remote node. LSF_DJOB_REPLACE_ENV can certainly be used to overcome this. But it may also have some side effect.

There are few things that still not quite clear to us. lsb_launch supposes to return a negative number, not sure why it did not in our case.

Not sure if it related to change set 19033 (https://svn.open-mpi.org/trac/ompi/changeset/19033) in certain way.

Teng