Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-03-29 11:47:12


Does this happen when you run without '-am ft-enable-cr' (so a no-C/R
run)?

This will help us determine if your problem is with the C/R work or
with the ORTE runtime. I suspect that there is something odd with your
system that is confusing the runtime (so not a C/R problem).

Have you made sure to remove the previous versions of Open MPI from
all machines on your cluster, before installing the new version?
Sometimes problems like this come up because of mismatches in Open MPI
versions on a machine.

-- Josh

On Mar 23, 2010, at 5:42 PM, fengguang tian wrote:

> I met the same problem with this link:http://www.open-mpi.org/community/lists/users/2009/12/11374.php
>
> in the link, they give a solution that use v1.4 open mpi instead of
> v1.3 open mpi. but, I am using v1.7a1r22794 open mpi, and met the
> same problem.
> here is what I have done:
> my cluster composed of two machines:nimbus(master) and
> nimbus1(slave), when I run mpirun -np 40 -am ft-enable-cr --
> hostfile .mpihostfile myapplication
> on the nimbus, and it doesn't work, it shows:
>
> [nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the
> sub-directory (/tmp/openmpi-sessions-mpiu_at_nimbus1_0/59759) of (/tmp/
> openmpi-sessions-mpiu_at_nimbus1_0/59759/0/1), mkdir failed [1]
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/
> session_dir.c at line 106
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/
> session_dir.c at line 399
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file base/
> ess_base_std_orted.c at line 301
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file rml_oob_send.c at line 104
> [nimbus1:21387] [[59759,0],1] could not get route to
> [[INVALID],INVALID]
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file util/show_help.c at line 602
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
> ess_env_module.c at line 143
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file rml_oob_send.c at line 104
> [nimbus1:21387] [[59759,0],1] could not get route to
> [[INVALID],INVALID]
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file util/show_help.c at line 602
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file runtime/
> orte_init.c at line 129
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file rml_oob_send.c at line 104
> [nimbus1:21387] [[59759,0],1] could not get route to
> [[INVALID],INVALID]
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is
> unknown in file util/show_help.c at line 602
> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file orted/
> orted_main.c at line 355
> --------------------------------------------------------------------------
> A daemon (pid 10737) died unexpectedly with status 255 while
> attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
>
>
> cheers
> fengguang
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users