Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpointing 2 or more processes running in parallel
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-09-08 10:06:49


Though I would not recommend your technique for initiating a
checkpoint from an application, it may work. Since ompi-checkpoint
will need to contact and interact with every MPI process, this could
cause problems if the application is blocking in system() while ompi-
checkpoint is trying to interact with the process. Additionally if you
are using any fork()-sensitive software/hardware (some high-speed
interconnects fall into this category) then calling system() (which
uses fork() on the back end) may cause a variety of problems including
memory corruption.

That being said, if you have configured Open MPI to use the C/R Fault
Tolerance thread then this may work. You will want to make sure that
only one MPI process in the entire job calls ompi-checkpoint (which is
probably the cause of the problem you mention below). The rest of the
processes can sit in a MPI_Barrier on the other side of the mychkpt()
operation if you want your processes to wait for the checkpoint to
finish before proceeding (though this is not required). Additionally
the MPI process that calls ompi-checkpoint will always need to be on
the same node as the mpirun process in order for the ompi-checkpoint
command to work.

Give that a try and let me know if it helps.

As a side note, I have an API for initiating a checkpoint operation
through Open MPI's Extensions interface. It is nearly ready, and will
probably be available on the Open MPI trunk in the next couple months.
I'll post the list when it is available if you want to give that a try.

-- Josh

On Aug 27, 2009, at 10:24 PM, Jean Potsam wrote:

> Dear all,
> I am trying to checkpoint an mpi application at
> specific points in my program. So, i created a small function as
> follows:
>
> void mychkpt()
> {
> system ("ompi-checkpoint -v `pidof mpirun`");
> }
>
> and I am calling it in my MPI application at specific points. e.g
>
> ##############
> printf("I am processor no %d of a total of %d procs \n", rank, size);
> system("sleep 6");
> mychkpt();
> printf("I am processor no %d of a total of %d procs \n", rank, size);
> system("sleep 4");
> mychkpt();
> #############
>
> If i do:
> mpirun -am ft-enable-cr -np 1 mpisleepts0,
>
> it works fine. but if i use more than 1 node there is a problem. e.g
>
> mpirun -am ft-enable-cr -np 2 mpisleepts0
>
> I get
>
> ################
> I am processor no 0 of a total of 2 procs
> I am processor no 1 of a total of 2 procs
> [jean:13673] orte_checkpoint: Checkpointing...
> [jean:13673] PID 13647
> [jean:13673] Connected to Mpirun [[28355,0],0]
> [jean:13673] orte_checkpoint: notify_hnp: Contact Head Node Process
> PID 13647
> [jean:13673] orte_checkpoint: notify_hnp: Requested a checkpoint of
> jobid [INVALID]
> [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
> [jean:13673] orte_checkpoint: hnp_receiver: Status Update.
> [jean:13673] Requested - Global Snapshot Reference:
> (null)
> [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
> [jean:13673] orte_checkpoint: hnp_receiver: Status Update.
> [jean:13673] Pending - Global Snapshot Reference:
> (null)
> [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
> [jean:13673] orte_checkpoint: hnp_receiver: Status Update.
> [jean:13673] Running - Global Snapshot Reference:
> (null)
> [jean:13672] orte_checkpoint: Checkpointing...
> [jean:13672] PID 13647
> [jean:13672] Connected to Mpirun [[28355,0],0]
> [jean:13672] orte_checkpoint: notify_hnp: Contact Head Node Process
> PID 13647
> [jean:13672] orte_checkpoint: notify_hnp: Requested a checkpoint of
> jobid [INVALID]
> [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
> [jean:13673] orte_checkpoint: hnp_receiver: Status Update.
> [jean:13673] File Transfer - Global Snapshot Reference:
> (null)
> [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
> [jean:13673] orte_checkpoint: hnp_receiver: Status Update.
> [jean:13673] Finished - Global Snapshot Reference:
> ompi_global_snapshot_13647.ckptSnapshot Ref.: 0
> ompi_global_snapshot_13647.ckpt
> ^Xmpirun: killing job...
> #################
>
> It runs the function twice simultaneously which try to call the
> checkpointing process twice...thus causing problems.
>
> How can i ensure that the checkpointing process is called only once
> when there are more than one process running?
>
> Please given me some ideas on it.
>
> Thank you
>
> Jean
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users