Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] (no subject)
From: Wong, Wayne (Wayne.Wong_at_[hidden])
Date: 2008-01-24 12:38:59


I'm having some difficulty geting the Open MPI checkpoint/restart fault
tolerance working. I have compiled Open MPI with the "--with-ft=cr"
flag, but when I attempt to run my test program (ring), the
ompi-checkpoint command fails. I have verified that the test program
works fine without the fault tolerance enabled. Here are the details:
 
     [me_at_dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring
     [me_at_dev1 ~]$ ps -efa | grep mpirun
     me 3052 2820 1 08:25 pts/2 00:00:00 mpirun -np 4 -am
ft-enable-cr ring
 

     [me_at_dev1 ~]$ ompi-checkpoint 3052
     [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error:
5854512 in file sds_singleton_module.c at line 50
     [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error:
5854512 in file runtime/orte_init.c at line 311
 
------------------------------------------------------------------------

--
     It looks like orte_init failed for some reason; your parallel
process is
     likely to abort.  There are many reasons that a parallel process
can
     fail during orte_init; some of which are due to configuration or
     environment problems.  This failure appears to be an internal
failure;
     here's some additional information (which may only be relevant to
an
     Open MPI developer):
 
       orte_sds_base_set_name failed
       --> Returned value Unknown error: 5854512 (5854512) instead of
ORTE_SUCCESS
 
 
------------------------------------------------------------------------
--
Any help would be appreciated.  Thanks.