I'm having some difficulty geting the Open MPI checkpoint/restart fault
tolerance working. I have compiled Open MPI with the "--with-ft=cr"
flag, but when I attempt to run my test program (ring), the
ompi-checkpoint command fails. I have verified that the test program
works fine without the fault tolerance enabled. Here are the details:
[me_at_dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring
[me_at_dev1 ~]$ ps -efa | grep mpirun
me 3052 2820 1 08:25 pts/2 00:00:00 mpirun -np 4 -am
ft-enable-cr ring
[me_at_dev1 ~]$ ompi-checkpoint 3052
[dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error:
5854512 in file sds_singleton_module.c at line 50
[dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error:
5854512 in file runtime/orte_init.c at line 311
------------------------------------------------------------------------
--
It looks like orte_init failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process
can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal
failure;
here's some additional information (which may only be relevant to
an
Open MPI developer):
orte_sds_base_set_name failed
--> Returned value Unknown error: 5854512 (5854512) instead of
ORTE_SUCCESS
------------------------------------------------------------------------
--
Any help would be appreciated. Thanks.
|