Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-02-08 08:54:01


You can use the 'checkpoint to local disk' example to checkpoint and restart without access to a globally shared storage devices. There is an example on the website that does not use a globally mounted file system:
  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local

What version of Open MPI are you using? This functionality is known to be broken on the v1.3/1.4 branches, per the ticket below:
  https://svn.open-mpi.org/trac/ompi/ticket/2139

Try the nightly snapshot of the 1.5 branch or the development trunk, and see if this issues still occurs.

-- Josh

On Feb 8, 2010, at 8:35 AM, Andreea Costea wrote:

> I asked this question because checkpointing with to NFS is successful, but checkpointing without a mount filesystem or a shared storage throws this warning&error:
>
> WARNING: Could not preload specified file: File already exists.
> Fileset: /home/andreea/checkpoints/global/ompi_global_snapshot_7426.ckpt/0
> Host: X
>
> Will continue attempting to launch the process.
>
>
> filem:rsh: wait_all(): Wait failed (-1)
> [[62871,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054
>
> even if I set the mca-parameters like this:
> snapc_base_store_in_place=0
>
> crs_base_snapshot_dir
> =/home/andreea/checkpoints/local
>
> snapc_base_global_snapshot_dir
> =/home/andreea/checkpoints/global
> and the nodes can connect through ssh without a password.
>
> Thanks,
> Andreea
>
> On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costea <andre.costea_at_[hidden]> wrote:
> Hi,
>
> Let's say I have an MPI application running on several hosts. Is there any way to checkpoint this application without having a shared storage between the nodes?
> I already took a look at the examples here http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that in both cases there is a globally mounted file system.
>
> Thanks,
> Andreea
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users