Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with checkpointing multihosts, multiprocesses MPI application
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-01-11 16:06:00


On Dec 12, 2009, at 10:03 AM, Kritiraj Sajadah wrote:

> Dear All,
> I am trying to checkpoint am MPI application which has two
> processes each running on two seperate hosts.
>
> I run the application as follows:
>
> raj_at_sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca
> btl ^openib -mca snapc_base_global_snapshot_dir /tmp m.

Try setting the 'snapc_base_global_snapshot_dir' in your
$HOME/.openmpi/mca-params.conf file instead of on the command line.
This way it will be properly picked up by the ompi-restart commands.

See the link below for how to do this:
   http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-global

>
> and I trigger the checkpoint as follows:
>
> raj_at_sun32:~$ ompi-checkpoint -v 30010
>
>
> The following happens displaying two errors which checkpointng the
> application:
>
>
> ##############################################
> I am processor no 0 of a total of 2 procs on host sun32
> I am processor no 1 of a total of 2 procs on host sun06
> I am processorrrrrrrr no 0 of a total of 2 procs on host sun32
> I am processorrrrrrrr no 1 of a total of 2 procs on host sun06
>
> [sun32:30010] Error: expected_component: PID information unavailable!
> [sun32:30010] Error: expected_component: Component Name information
> unavailable!

The only way this error could be generated when checkpointing (versus
restarting) is if the Snapshot Coordinator failed to propagate the CRS
component used so that it could be stored in the metadata. If this
continues to happen try enabling debugging in the snapshot coordinator:
  mpirun -mca snapc_full_verbose 20 ...

>
> I am processssssssssssor no 1 of a total of 2 procs on host sun06
> I am processssssssssssor no 0 of a total of 2 procs on host sun32
> bye
> bye
> ############################################
>
>
>
>
> when I try to restart the application from the checkpointed file, I
> get the following:
>
> raj_at_sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_1.ckpt) is invalid because either
> you have not provided a filename
> or provided an invalid filename.
> Please see --help for usage.
>
> --------------------------------------------------------------------------
> I am processssssssssssor no 0 of a total of 2 procs on host sun32
> bye

This usually indicates that either:
  1) The local checkpoint directory (opal_snapshot_1.ckpt) is missing.
So the global checkpoint is either corrupted, or the node where rank 1
resided was not able to access the storage location (/tmp in your
example).
  2) You moved the ompi_global_snapshot_30010.ckpt directory from /tmp
to somewhere else. Currently, manually moving the global checkpoint
directory is not supported.

-- Josh

>
>
> I would very appreciate if you could give me some ideas on how to
> checkpoint and restart MPI application running on multiple hosts.
>
> Thank you
>
> Regards,
>
> Raj
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users