Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to checkpoint atomic function in OpenMPI
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-07-16 17:34:59

On Jun 14, 2010, at 5:26 AM, Nguyen Toan wrote:

> Hi all,
> I have a MPI program as follows:
> -------------------
> int main(){
> MPI_Init();
> ......
> for (i=0; i<10000; i++) {
> my_atomic_func();
> }
> ...
> MPI_Finalize();
> return 0;
> }
> --------------------
> The runtime of this program mainly involves in running the loop and my_atomic_func() takes a little bit long.
> Here I want my_atomic_func() to be operated atomically, but the timing of checkpointing (by running ompi-checkpoint command) may be in the middle of my_atomic_func() operation and hence ompi-restart may fail to restart correctly.
> So my question is:
> + At the checkpoint time (executing ompi-checkpoint), is there a way to let OpenMPI wait until my_atomic_func() finishes its operation?

We do not currently have an external function to declare a critical section during which a checkpoint should not be taken. I filed a ticket to make one available. The link is below if you would like to follow its progress:

I have an MPI Extension interface for C/R that I will be bringing into the trunk in the next few weeks. I should be able to extend it to include this feature. But I can't promise a deadline, just that I will update the ticket when it is available.

In the mean time you might try to use the BLCR interface to define critical sections. If you are using the C/R thread then this may work (though I have not tried it):

> + How does ompi-checkpoint operate to checkpoint MPI threads?

We depend on the Checkpoint/Restart Service (e.g., BLCR) to capture the whole process image including all threads. So BLCR will capture the state of all threads when we take the process checkpoint.

-- Josh

> Regards,
> Nguyen Toan
> _______________________________________________
> users mailing list
> users_at_[hidden]