Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] EXTERNAL: Re: Problems with shared libraries while launching jobs
From: Blosch, Edwin L (edwin.l.blosch_at_[hidden])
Date: 2012-12-18 11:06:38


libimf.so is present on all nodes, by design. However, some times the simulation runs and other times not. I have a suspicion that the filesystem (GPFS) where the Intel library is located, may become temporarily unavailable in the failure cases. I do not suspect any problem with OpenMPI, but I am hopeful that it can produce diagnostics that indicate the root cause of the problem.

I have followed Ralph's advice to build with --enable-debug and am now waiting for the problem to happen again so I can see the ssh command used to launch the orted.

-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Reuti
Sent: Tuesday, December 18, 2012 4:14 AM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: Problems with shared libraries while launching jobs

Am 17.12.2012 um 16:42 schrieb Blosch, Edwin L:

> Ralph,
>
> Unfortunately I didn't see the ssh output. The output I got was pretty much as before.
>
> You know, the fact that the error message is not prefixed with a host name makes me think it could be happening on the host where the job is placed by PBS. If there is something wrong in the user environment prior to mpirun, that is not an OpenMPI problem. And yet, in one of the jobs that failed, I have also printed outthe results of 'ldd' on the mpirun executable just prior to executing the command, and all the shared libraries were resolved:

You checked the mpirun, but not the orted which misses a "libimf.so" from Intel. The Intel libimf.so from the redistributable archive is present on all nodes?

-- Reuti

>
> ldd /release/cfd/openmpi-intel/bin/mpirun
> linux-vdso.so.1 => (0x00007fffbbb39000)
> libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 (0x00002abdf75d2000)
> libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 (0x00002abdf7887000)
> libdl.so.2 => /lib64/libdl.so.2 (0x00002abdf7b39000)
> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002abdf7d3d000)
> libutil.so.1 => /lib64/libutil.so.1 (0x00002abdf7f56000)
> libm.so.6 => /lib64/libm.so.6 (0x00002abdf8159000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002abdf83af000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x00002abdf85c7000)
> libc.so.6 => /lib64/libc.so.6 (0x00002abdf87e4000)
> libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so (0x00002abdf8b42000)
> libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so (0x00002abdf8ed7000)
> libintlc.so.5 => /appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x00002abdf90ed000)
> /lib64/ld-linux-x86-64.so.2 (0x00002abdf73b1000)
>
> Hence my initial assumption that the shared-library problem was happening with one of the child processes on a remote node.
>
> So at this point I have more questions than answers. I still don't know if this message comes from the main mpirun process or one of the child processes, although it seems that it should not be the main process because of the output of ldd above.
>
> Any more suggestions are welcomed of course.
>
> Thanks
>
>
> /release/cfd/openmpi-intel/bin/mpirun --machinefile
> /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x
> MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached
> /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 10000 -ri restart.5000
> -ro /tmp/fv420804.maruhpc4-mgt/restart.5000
>
> [c6n38:16219] mca:base:select:( plm) Querying component [rsh]
> [c6n38:16219] mca:base:select:( plm) Query of component [rsh] set
> priority to 10 [c6n38:16219] mca:base:select:( plm) Selected
> component [rsh]
> Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M
> /release/cfd/openmpi-intel/bin/orted: error while loading shared
> libraries: libimf.so: cannot open shared object file: No such file or
> directory
> ----------------------------------------------------------------------
> ---- A daemon (pid 16227) died unexpectedly with status 127 while
> attempting to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared libraries on the remote node. You may set your LD_LIBRARY_PATH
> to have the location of the shared libraries on the remote nodes and
> this will automatically be forwarded to the remote nodes.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ---- mpirun noticed that the job aborted, but has no info as to the
> process that caused that situation.
> ----------------------------------------------------------------------
> ----
> Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M
> ----------------------------------------------------------------------
> ---- mpirun was unable to cleanly terminate the daemons on the nodes
> shown below. Additional manual cleanup may be required - please refer
> to the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
> c6n39 - daemon did not report back when launched
> c6n40 - daemon did not report back when launched
> c6n41 - daemon did not report back when launched
> c6n42 - daemon did not report back when launched
>
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Ralph Castain
> Sent: Friday, December 14, 2012 2:25 PM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries
> while launching jobs
>
> Add -mca plm_base_verbose 5 --leave-session-attached to the cmd line - that will show the ssh command being used to start each orted.
>
> On Dec 14, 2012, at 12:17 PM, "Blosch, Edwin L" <edwin.l.blosch_at_[hidden]> wrote:
>
>
> I am having a weird problem launching cases with OpenMPI 1.4.3. It is most likely a problem with a particular node of our cluster, as the jobs will run fine on some submissions, but not other submissions. It seems to depend on the node list. I just am having trouble diagnosing which node, and what is the nature of the problem it has.
>
> One or perhaps more of the orted are indicating they cannot find an Intel Math library. The error is:
> /release/cfd/openmpi-intel/bin/orted: error while loading shared
> libraries: libimf.so: cannot open shared object file: No such file or
> directory
>
> I've checked the environment just before launching mpirun, and LD_LIBRARY_PATH includes the necessary component to point to where the Intel shared libraries are located. Furthermore, my mpirun command line says to export the LD_LIBRARY_PATH variable:
> Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile
> /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x
> LD_LIBRARY_PATH', '-x MPI_ENVIRONMENT=1',
> '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', '-cycles',
> '10000', '-ri', 'restart.1', '-ro',
> '/tmp/fv420761.maruhpc4-mgt/restart.1']
>
> My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH. OpenMPI is built explicitly --without-torque and should be using ssh to launch the orted.
>
> What options can I add to get more debugging of problems launching orted?
>
> Thanks,
>
> Ed
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users