On Apr 20, 2009, at 9:29 PM, ESTEBAN MENESES ROJAS wrote:
> Is there any way to automatically checkpoint/restart an
> application in OpenMPI? This is, checkpointing the application
> without using the command ompi-checkpoint, perhaps via a function
> call in the application's code itself. The same with the restart
> after a failure.
Currently Open MPI only supports checkpointing/restart applications
using the ompi-checkpoint command and restarting with the ompi-restart
command. We do not expose a function call for the application to start
the checkpoint operation internally.
On a temporary branch, I developed an interface as part of a proposal
to the MPI Forum. It works for a coordinated checkpoint (all processes
must call the function similar to barrier). In its current state, it
is not ready to come to the trunk just yet since there is some support
structure missing that I am still working on.
This branch does not expose an interface to restart a process. What
that interface should look like quickly becomes a much more difficult
question. If you have ideas on the interface signature and semantics I
would be interested in hearing about them.
> On a related note, what is the default behavior of an OpenMPI
> application after one process fails? Does the runtime shut down the
> whole application?
If a process fails Open MPI, by default, will terminate the whole
application. Work is in progress by a couple of the core development
teams to provide alternative failure modes, but I do not think any of
this work has made it to the development trunk yet.
> Thanks. _______________________________________________
> users mailing list