Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trouble with fault tolerance checkpointing
From: Wong, Wayne (Wayne.Wong_at_[hidden])
Date: 2008-01-29 10:26:46

We have the checkpoint/restart working now. Turns out that the BLCR
kernel mods were installed incorrectly.

Thanks for the help.


-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Josh Hursey
Sent: Monday, January 28, 2008 6:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] (no subject)

I'm unable to reproduce this problem. :( I tried both the svn head
(r17288) and the tarball that you were using (openmpi-1.3a1r17175) on a
similar system without problem.

The error you are seeing may be caused by old connectivity information
in the session directory. You may want to make sure that / tmp does not
contain any "openmpi-session*" directories before starting mpirun.

Other than that you may want to try a clean build of Open MPI just to
make sure that you are not seeing anything odd resulting from old Open
MPI install files.

let me know if that helps.

-- Josh

On Jan 24, 2008, at 12:38 PM, Wong, Wayne wrote:

> I'm having some difficulty geting the Open MPI checkpoint/restart
> fault tolerance working. I have compiled Open MPI with the "--with-
> ft=cr" flag, but when I attempt to run my test program (ring), the
> ompi-checkpoint command fails. I have verified that the test program
> works fine without the fault tolerance enabled. Here are the details:
> [me_at_dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring
> [me_at_dev1 ~]$ ps -efa | grep mpirun
> me 3052 2820 1 08:25 pts/2 00:00:00 mpirun -np 4 -am
> ft-enable-cr ring
> [me_at_dev1 ~]$ ompi-checkpoint 3052
> [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown
> error: 5854512 in file sds_singleton_module.c at line 50
> [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown
> error: 5854512 in file runtime/orte_init.c at line 311
> ----------------------------------------------------------------------
> ----
> It looks like orte_init failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process
> can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal
> failure;
> here's some additional information (which may only be relevant to

> an
> Open MPI developer):
> orte_sds_base_set_name failed
> --> Returned value Unknown error: 5854512 (5854512) instead of
> ----------------------------------------------------------------------
> ----
> Any help would be appreciated. Thanks.
> <ompi_info.txt.gz><config.log.gz>
> _______________________________________________
> users mailing list
> users_at_[hidden]

users mailing list