Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open MPI 1.3rc2: Restarting from a checkpoint
From: Gregor Dschung (gregor.dschung_at_[hidden])
Date: 2009-01-07 11:13:17


Yeah!

It's working fine. I just forgot to share the homedirs on both hosts,
where the checkpoint is written.

 -Gregor
> Hi,
>
> first, my resources: I've two SLES10 machines with Open MPI 1.3rc2
> installed. It's configured with ./configure --prefix=/usr/local
> --with-ft=cr --enable-ft-thread --enable-mpi-threads. I've installed
> BLCR 0.7.3, too. The hosts are called dschungsles10-1 and
> dschungsles10-2. My MPI-Apps are located in /srv/mpi/ on
> dschungsles10-1, which is also exported via NFS to dschungsles10-2.
>
> I'm able to restart a MPI-Application a.out from a checkpoint, if I use
> only one host (mpirun -np 4 -am ft-enable-cr a.out)
>
> Now, I'm trying to restart my application which I started over two
> hosts. Taking the snapshot works fine:
>
> demo_at_dschungsles10-1:~> ps aux | grep mpirun
> demo 8637 27.8 0.0 33364 2308 pts/2 R+ 16:06 0:02 mpirun
> -np 4 -am ft-enable-cr -host dschun
> gsles10-2 -v a.out
> demo 8658 0.0 0.0 2736 480 pts/3 R+ 16:07 0:00 grep mpirun
> demo_at_dschungsles10-1:~> ompi-checkpoint -v -s 8637
> [dschungsles10-1:08661] orte_checkpoint: Checkpointing...
> [dschungsles10-1:08661] PID 8637
> [dschungsles10-1:08661] Connected to Mpirun [[417,0],0]
> [dschungsles10-1:08661] orte_checkpoint: notify_hnp: Contact Head Node
> Process PID 8637
> [dschungsles10-1:08661] orte_checkpoint: notify_hnp: Requested a
> checkpoint of jobid [INVALID]
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
> message.
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
> [dschungsles10-1:08661] Requested - Global Snapshot
> Reference: (null)
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
> message.
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
> [dschungsles10-1:08661] Pending - Global Snapshot
> Reference: (null)
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
> message.
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
> [dschungsles10-1:08661] Running - Global Snapshot
> Reference: (null)
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
> message.
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
> [dschungsles10-1:08661] File Transfer - Global Snapshot
> Reference: (null)
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
> message.
> [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
> [dschungsles10-1:08661] Finished - Global Snapshot
> Reference: ompi_global_snapshot_8637.ckpt
> Snapshot Ref.: 0 ompi_global_snapshot_8637.ckpt
>
> But restarting doesn't work:
>
> demo_at_dschungsles10-1:~> ompi-restart -v ompi_global_snapshot_8637.ckpt
> [dschungsles10-1:08687] Checking for the existence of
> (/home/demo/ompi_global_snapshot_8637.ckpt)
> [dschungsles10-1:08687] Restarting from file
> (ompi_global_snapshot_8637.ckpt)
> [dschungsles10-1:08687] Exec in self
> Password:
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_0.ckpt) is invalid because either you
> have not provided a filename
> or provided an invalid filename.
> Please see --help for usage.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_1.ckpt) is invalid because either you
> have not provided a filename
> or provided an invalid filename.
> Please see --help for usage.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_2.ckpt) is invalid because either you
> have not provided a filename
> or provided an invalid filename.
> Please see --help for usage.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_3.ckpt) is invalid because either you
> have not provided a filename
> or provided an invalid filename.
> Please see --help for usage.
>
> --------------------------------------------------------------------------
>
> Perhaps, somebody has a few ideas...
>
> -Gregor
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Gregor Dschung
System Life Guard, HiWi
Fraunhofer-Institut für Techno-
und Wirtschaftsmathematik ITWM
Fraunhofer-Platz 1
D-67663 Kaiserslautern
E-Mail:   gregor.dschung_at_[hidden]
Internet: www.itwm.fraunhofer.de