Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI daemon died unexpectedly
From: Grzegorz Maj (maju3_at_[hidden])
Date: 2012-03-27 04:14:30


Hi,
I have an MPI application using ScaLAPACK routines. I'm running it on
OpenMPI 1.4.3. I'm using mpirun to launch less than 100 processes. I'm
using it quite extensively for almost two years and it almost always
works fine. However, once every 3-4 months I get the following error
during the execution:

--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------

It says that the daemon died while attempting to launch, but my
application (MPI grid) was running for about 14 minutes before it
failed. I can say that based on the log messages I'm producing during
the execution of my application. There is no more information from
mpirun. One more thing I know is that mpirun exit status was 1, but I
guess it is not very helpful. There are no core files.

I would appreciate any suggestions on how to debug this issue.

Regards,
Grzegorz Maj