Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Problem with checkpointing multihosts, multiprocesses MPI application
From: Kritiraj Sajadah (ksajadah_at_[hidden])
Date: 2009-12-12 10:03:10


Dear All,
         I am trying to checkpoint am MPI application which has two processes each running on two seperate hosts.

I run the application as follows:

raj_at_sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir /tmp m.

and I trigger the checkpoint as follows:

raj_at_sun32:~$ ompi-checkpoint -v 30010

The following happens displaying two errors which checkpointng the application:

##############################################
I am processor no 0 of a total of 2 procs on host sun32
I am processor no 1 of a total of 2 procs on host sun06
I am processorrrrrrrr no 0 of a total of 2 procs on host sun32
I am processorrrrrrrr no 1 of a total of 2 procs on host sun06

[sun32:30010] Error: expected_component: PID information unavailable!
[sun32:30010] Error: expected_component: Component Name information unavailable!

I am processssssssssssor no 1 of a total of 2 procs on host sun06
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye
bye
############################################

when I try to restart the application from the checkpointed file, I get the following:

raj_at_sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye

I would very appreciate if you could give me some ideas on how to checkpoint and restart MPI application running on multiple hosts.

Thank you

Regards,

Raj