Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Problem with checkpointing multihosts, multiprocesses MPI application
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-01-11 16:06:00


On Dec 12, 2009, at 10:03 AM, Kritiraj Sajadah wrote:

> Dear All,
> I am trying to checkpoint am MPI application which has two
> processes each running on two seperate hosts.
>
> I run the application as follows:
>
> raj_at_sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca
> btl ^openib -mca snapc_base_global_snapshot_dir /tmp m.

Try setting the 'snapc_base_global_snapshot_dir' in your
$HOME/.openmpi/mca-params.conf file instead of on the command line.
This way it will be properly picked up by the ompi-restart commands.

See the link below for how to do this:
   http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-global

>
> and I trigger the checkpoint as follows:
>
> raj_at_sun32:~$ ompi-checkpoint -v 30010
>
>
> The following happens displaying two errors which checkpointng the
> application:
>
>
> ##############################################
> I am processor no 0 of a total of 2 procs on host sun32
> I am processor no 1 of a total of 2 procs on host sun06
> I am processorrrrrrrr no 0 of a total of 2 procs on host sun32
> I am processorrrrrrrr no 1 of a total of 2 procs on host sun06
>
> [sun32:30010] Error: expected_component: PID information unavailable!
> [sun32:30010] Error: expected_component: Component Name information
> unavailable!

The only way this error could be generated when checkpointing (versus
restarting) is if the Snapshot Coordinator failed to propagate the CRS
component used so that it could be stored in the metadata. If this
continues to happen try enabling debugging in the snapshot coordinator:
  mpirun -mca snapc_full_verbose 20 ...

>
> I am processssssssssssor no 1 of a total of 2 procs on host sun06
> I am processssssssssssor no 0 of a total of 2 procs on host sun32
> bye
> bye
> ############################################
>
>
>
>
> when I try to restart the application from the checkpointed file, I
> get the following:
>
> raj_at_sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_1.ckpt) is invalid because either
> you have not provided a filename
> or provided an invalid filename.
> Please see --help for usage.
>
> --------------------------------------------------------------------------
> I am processssssssssssor no 0 of a total of 2 procs on host sun32
> bye

This usually indicates that either:
  1) The local checkpoint directory (opal_snapshot_1.ckpt) is missing.
So the global checkpoint is either corrupted, or the node where rank 1
resided was not able to access the storage location (/tmp in your
example).
  2) You moved the ompi_global_snapshot_30010.ckpt directory from /tmp
to somewhere else. Currently, manually moving the global checkpoint
directory is not supported.

-- Josh

>
>
> I would very appreciate if you could give me some ideas on how to
> checkpoint and restart MPI application running on multiple hosts.
>
> Thank you
>
> Regards,
>
> Raj
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users