I'm having some difficulty geting the Open MPI checkpoint/restart fault tolerance working.  I have compiled Open MPI with the "--with-ft=cr" flag, but when I attempt to run my test program (ring), the ompi-checkpoint command fails.  I have verified that the test program works fine without the fault tolerance enabled.  Here are the details:
 
     [me@dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring
     [me@dev1 ~]$ ps -efa | grep mpirun
     me     3052  2820  1 08:25 pts/2    00:00:00 mpirun -np 4 -am ft-enable-cr ring
 

     [me@dev1 ~]$ ompi-checkpoint 3052
     [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error: 5854512 in file sds_singleton_module.c at line 50
     [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error: 5854512 in file runtime/orte_init.c at line 311
     --------------------------------------------------------------------------
     It looks like orte_init failed for some reason; your parallel process is
     likely to abort.  There are many reasons that a parallel process can
     fail during orte_init; some of which are due to configuration or
     environment problems.  This failure appears to be an internal failure;
     here's some additional information (which may only be relevant to an
     Open MPI developer):
 
       orte_sds_base_set_name failed
       --> Returned value Unknown error: 5854512 (5854512) instead of ORTE_SUCCESS
 
     --------------------------------------------------------------------------
Any help would be appreciated.  Thanks.