Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster
From: fengguang tian (fernyabc_at_[hidden])
Date: 2010-03-29 11:56:09


hi
I solve this problem, some previous versions of directories in the cluster
are not removed, after I remove them, it works fine. thank you

cheers
fengguang

On Mon, Mar 29, 2010 at 11:47 AM, Josh Hursey <jjhursey_at_[hidden]> wrote:

> Does this happen when you run without '-am ft-enable-cr' (so a no-C/R run)?
>
> This will help us determine if your problem is with the C/R work or with
> the ORTE runtime. I suspect that there is something odd with your system
> that is confusing the runtime (so not a C/R problem).
>
> Have you made sure to remove the previous versions of Open MPI from all
> machines on your cluster, before installing the new version? Sometimes
> problems like this come up because of mismatches in Open MPI versions on a
> machine.
>
> -- Josh
>
>
> On Mar 23, 2010, at 5:42 PM, fengguang tian wrote:
>
> I met the same problem with this link:
>> http://www.open-mpi.org/community/lists/users/2009/12/11374.php
>>
>> in the link, they give a solution that use v1.4 open mpi instead of v1.3
>> open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem.
>> here is what I have done:
>> my cluster composed of two machines:nimbus(master) and nimbus1(slave),
>> when I run mpirun -np 40 -am ft-enable-cr --hostfile .mpihostfile
>> myapplication
>> on the nimbus, and it doesn't work, it shows:
>>
>> [nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the
>> sub-directory (/tmp/openmpi-sessions-mpiu_at_nimbus1_0/59759) of
>> (/tmp/openmpi-sessions-mpiu_at_nimbus1_0/59759/0/1), mkdir failed [1]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 106
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 399
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> base/ess_base_std_orted.c at line 301
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> ess_env_module.c at line 143
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> runtime/orte_init.c at line 129
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> orted/orted_main.c at line 355
>> --------------------------------------------------------------------------
>> A daemon (pid 10737) died unexpectedly with status 255 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>>
>>
>> cheers
>> fengguang
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>