Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] checkpointing 2 or more processes running in parallel
From: Jean Potsam (jeanpotsam_at_[hidden])
Date: 2009-08-27 22:24:33


Dear all,
              I am trying to checkpoint an mpi application at specific points in my program. So, i created a small function as follows:

void mychkpt()
{
system ("ompi-checkpoint -v `pidof mpirun`");
}

and I am calling it in my MPI application at specific points. e.g

##############
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 6");
mychkpt();
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 4");
mychkpt();
#############

If i do:
 mpirun -am ft-enable-cr -np 1 mpisleepts0,

it works fine. but if i use more than 1 node there is a problem. e.g

mpirun -am ft-enable-cr -np 2 mpisleepts0

I get

################
I am processor no 0 of a total of 2 procs
I am processor no 1 of a total of 2 procs
[jean:13673] orte_checkpoint: Checkpointing...
[jean:13673]      PID 13647
[jean:13673]      Connected to Mpirun [[28355,0],0]
[jean:13673] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647
[jean:13673] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]                 Requested - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]                   Pending - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]                   Running - Global Snapshot Reference: (null)
[jean:13672] orte_checkpoint: Checkpointing...
[jean:13672]      PID 13647
[jean:13672]      Connected to Mpirun [[28355,0],0]
[jean:13672] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647
[jean:13672] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]             File Transfer - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]                  Finished - Global Snapshot Reference: ompi_global_snapshot_13647.ckptSnapshot Ref.:   0 ompi_global_snapshot_13647.ckpt
^Xmpirun: killing job...
#################

It runs the function twice simultaneously which try to call the checkpointing process twice...thus causing problems.

How can i ensure that the checkpointing process is called only once when there are more than one process running?

Please given me some ideas on it.

Thank you

Jean