It is quite likely that the lsf integration on the 1.6 series is broken. We don't have a way to test it any more (all our LSF access is gone). I recently was briefly given access to an LSF machine and fixed it for the 1.7 series, but that series doesn't support checkpoint/restart.


On Mar 30, 2013, at 1:01 AM, Jorge Naranjo Bouzas <jonarbo@gmail.com> wrote:

Hello!

We are having problems integrating BLCR + OpenMPI + LSF in a linux cluster with Infiniband

We compiled OpenMPI version 1.6 with gcc version 4.6.0 ... The configure line was like:

./configure --prefix=/opt/share/mpi-openmpi/1.6-gcc-4.6.0/el6/x86_64 --with-lsf --with-openib --with-blcr=/opt/share/blcrv0.8.4.app/ --with-ft=cr --enable-ft-thread --enable-opal-multi-threads --with-psm

The problem I am having is that for some reason the ft-enable-cr features freezes my mpi application when I use more that one node. The job is never started ...

We narrowed the search down and we noticed that when mpirun is used out of the batch system, it works... but as soon as the mpirun detects the env variable LSB_JOBID and assumes it is under LSF environment, the problem arises... Additionally, if we use "--mca plm rsh" which should deactivate the LSF integration , it works again, as expected...

So, or guess is: or there is something misconfigured in our LSF or there is a problem in the plm module inside openmpi ...

Any hint???

Thanks!!

Jorge Naranjo

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users