Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster
From: fengguang tian (fernyabc_at_[hidden])
Date: 2010-03-23 17:42:52


I met the same problem with this link:
http://www.open-mpi.org/community/lists/users/2009/12/11374.php

in the link, they give a solution that use v1.4 open mpi instead of v1.3
open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem.
here is what I have done:
my cluster composed of two machines:nimbus(master) and nimbus1(slave), when
I run mpirun -np 40 -am ft-enable-cr --hostfile .mpihostfile myapplication
on the nimbus, and it doesn't work, it shows:

[nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the
sub-directory (/tmp/openmpi-sessions-mpiu_at_nimbus1_0/59759) of
(/tmp/openmpi-sessions-mpiu_at_nimbus1_0/59759/0/1), mkdir failed [1]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 106
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 399
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
base/ess_base_std_orted.c at line 301
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c
at line 143
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 129
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
orted/orted_main.c at line 355
--------------------------------------------------------------------------
A daemon (pid 10737) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

cheers
fengguang