Ah, now that's a little different failure mode than your original description. If it works without CR enabled, then the launcher is working just fine. The problem is in the checkpoint/restart integration.

There are some things that get initialized differently under CR, but I have no idea what they do or why they would have a problem when launched by LSF. I'm afraid our CR person has moved on to other pastures, so there is little we can do to help at this stage.

If you can run it with rsh, then perhaps that would be adequate? Afraid that's the best I can offer :-(

On Mar 31, 2013, at 1:18 AM, Jorge Naranjo Bouzas <jonarbo@gmail.com> wrote:

Hi Ralph!

Thanks for your quick response ... 

I also tried version 1.4.5  with the same result  ..

What changes did you make to version 1.7 ? Would that apply for 1.6 as well?

I have bounded the failure to the file 

precisely  the "lsb_launch" call :

 if (lsb_launch(nodelist_argv, argv, LSF_DJOB_REPLACE_ENV | LSF_DJOB_NOWAIT, env) < 0) {

apparently, when enabling CR,  this call  gets stalled and the code seems to keep waiting in :

   /* wait for daemons to callback */
   if (ORTE_SUCCESS !=
      (rc = orte_plm_base_daemon_callback(map->num_new_daemons))) {
      OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
                        "%s plm:lsf: daemon launch failed for job %s on error %s",
                        ORTE_JOBID_PRINT(active_job), ORTE_ERROR_NAME(rc)));
       goto cleanup;

For both cases (with and without "-am ft-enable-cr" option) I have dumped the values of "nodelist_argv" , "argv" and "env" and the only significative differences (other than the PID or JOBID ) are:

< argv = orted -mca ess lsf -mca orte_ess_jobid 2036989952 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2036989952.0;tcp://;tcp://"
> argv = orted -mca ess lsf -mca orte_ess_jobid 2116943872 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2116943872.0;tcp://;tcp://" -mca mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path /opt/share/mpi-openmpi/1.4.5-gcc-4.6.0/el6/x86_64/share/openmpi/amca-param-sets:/home/naranjja/Tests/BLCR -mca mca_base_param_file_path_force /home/naranjja/Tests/BLCR


> OMPI_MCA_mca_base_param_file_prefix=ft-enable-cr

When run without "-am ft-enable-cr" it works but when I enable it, the processes are never started ... :S



On Sat, Mar 30, 2013 at 6:29 PM, Ralph Castain <rhc@open-mpi.org> wrote:
It is quite likely that the lsf integration on the 1.6 series is broken. We don't have a way to test it any more (all our LSF access is gone). I recently was briefly given access to an LSF machine and fixed it for the 1.7 series, but that series doesn't support checkpoint/restart.

On Mar 30, 2013, at 1:01 AM, Jorge Naranjo Bouzas <jonarbo@gmail.com> wrote:


We are having problems integrating BLCR + OpenMPI + LSF in a linux cluster with Infiniband

We compiled OpenMPI version 1.6 with gcc version 4.6.0 ... The configure line was like:

./configure --prefix=/opt/share/mpi-openmpi/1.6-gcc-4.6.0/el6/x86_64 --with-lsf --with-openib --with-blcr=/opt/share/blcrv0.8.4.app/ --with-ft=cr --enable-ft-thread --enable-opal-multi-threads --with-psm

The problem I am having is that for some reason the ft-enable-cr features freezes my mpi application when I use more that one node. The job is never started ...

We narrowed the search down and we noticed that when mpirun is used out of the batch system, it works... but as soon as the mpirun detects the env variable LSB_JOBID and assumes it is under LSF environment, the problem arises... Additionally, if we use "--mca plm rsh" which should deactivate the LSF integration , it works again, as expected...

So, or guess is: or there is something misconfigured in our LSF or there is a problem in the plm module inside openmpi ...

Any hint???


Jorge Naranjo

users mailing list