Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI + BLCR + LSF integration
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-03-30 11:29:53


It is quite likely that the lsf integration on the 1.6 series is broken. We don't have a way to test it any more (all our LSF access is gone). I recently was briefly given access to an LSF machine and fixed it for the 1.7 series, but that series doesn't support checkpoint/restart.

On Mar 30, 2013, at 1:01 AM, Jorge Naranjo Bouzas <jonarbo_at_[hidden]> wrote:

> Hello!
>
> We are having problems integrating BLCR + OpenMPI + LSF in a linux cluster with Infiniband
>
> We compiled OpenMPI version 1.6 with gcc version 4.6.0 ... The configure line was like:
>
> ./configure --prefix=/opt/share/mpi-openmpi/1.6-gcc-4.6.0/el6/x86_64 --with-lsf --with-openib --with-blcr=/opt/share/blcrv0.8.4.app/ --with-ft=cr --enable-ft-thread --enable-opal-multi-threads --with-psm
>
> The problem I am having is that for some reason the ft-enable-cr features freezes my mpi application when I use more that one node. The job is never started ...
>
> We narrowed the search down and we noticed that when mpirun is used out of the batch system, it works... but as soon as the mpirun detects the env variable LSB_JOBID and assumes it is under LSF environment, the problem arises... Additionally, if we use "--mca plm rsh" which should deactivate the LSF integration , it works again, as expected...
>
> So, or guess is: or there is something misconfigured in our LSF or there is a problem in the plm module inside openmpi ...
>
> Any hint???
>
> Thanks!!
>
> Jorge Naranjo
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users