Note that exporting the LD_LIBRARY_PATH on the mpirun command line does not necessarily apply to launching the remote orteds (it applies to launching the remote MPI processes, which are children of the orteds).
Since you're using ssh, you might want to check the shell startup scripts on the target nodes (e.g., .bashrc). It's not sufficient to not overwrite the LD_LIBRARY_PATH -- ensure that it is getting set to the right library location of the intel support libraries.
You might also want to check your .bashrc that you're not setting LD_LIBRARY_PATH (or path or ...) after it exits for non-interactive shells. This is a common optimization trick in shell startup files -- exit early when it detects that this is a non-interactive shell, and therefore don't do a bunch of stuff that assumedly is only needed when you login interactively (e.g., create shell aliases and the like).
Random question: is there a reason you're not using torque support? When you use torque support, torque will automatically copy your current environment -- including LD_LIBRARY_PATH -- to the target node before launching orted. Hence, it can actually be easier for LD_LIBRARY_PATH issues like this.
On Dec 14, 2012, at 3:17 PM, Blosch, Edwin L wrote:
> I am having a weird problem launching cases with OpenMPI 1.4.3. It is most likely a problem with a particular node of our cluster, as the jobs will run fine on some submissions, but not other submissions. It seems to depend on the node list. I just am having trouble diagnosing which node, and what is the nature of the problem it has.
> One or perhaps more of the orted are indicating they cannot find an Intel Math library. The error is:
> /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
> Ive checked the environment just before launching mpirun, and LD_LIBRARY_PATH includes the necessary component to point to where the Intel shared libraries are located. Furthermore, my mpirun command line says to export the LD_LIBRARY_PATH variable:
> Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x LD_LIBRARY_PATH', '-x MPI_ENVIRONMENT=1', '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', '-cycles', '10000', '-ri', 'restart.1', '-ro', '/tmp/fv420761.maruhpc4-mgt/restart.1']
> My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH. OpenMPI is built explicitly --without-torque and should be using ssh to launch the orted.
> What options can I add to get more debugging of problems launching orted?
> users mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/