We are having problems integrating BLCR + OpenMPI + LSF in a linux cluster with Infiniband
We compiled OpenMPI version 1.6 with gcc version 4.6.0 ... The configure line was like:
./configure --prefix=/opt/share/mpi-openmpi/1.6-gcc-4.6.0/el6/x86_64 --with-lsf --with-openib --with-blcr=/opt/share/blcrv0.8.4.app/ --with-ft=cr --enable-ft-thread --enable-opal-multi-threads --with-psm
The problem I am having is that for some reason the ft-enable-cr features freezes my mpi application when I use more that one node. The job is never started ...
We narrowed the search down and we noticed that when mpirun is used out of the batch system, it works... but as soon as the mpirun detects the env variable LSB_JOBID and assumes it is under LSF environment, the problem arises... Additionally, if we use "--mca plm rsh" which should deactivate the LSF integration , it works again, as expected...
So, or guess is: or there is something misconfigured in our LSF or there is a problem in the plm module inside openmpi ...
users mailing list