I know that in the past it has been supported via toolkits like BLCR,
but I don't know the current level of support, to be honest. I think I
heard somewhere that the checkpoint/restart support in OpenMPI was going
away in some fashion.
In any case, if you have the ability to set up application-aware,
application-specific checkpointing, it will be a much better solution
than something that's application-agnostic. The checkpoint files will
be smaller (the application knows what in memory is important, and what
isn't), coordination will be better between processes, you have some
level of assurance that you won't have PID conflicts or problems when
the PID ends up different, etc.
I suspect someone on the list can answer your question about the
built-in checkpoint/restart code better than I can. But in general, if
you have a choice between checkpointing external and internal to your
application, choose the application-internal checkpointing.
Fulton Supercomputing Lab
Brigham Young University
On 07/19/2013 01:34 PM, Erik Nelson wrote:
> I run mpi on an NSF computer. One of the conditions of use is that jobs
> are limited to 24 hr
> duration to provide democratic allotment to its users.
> A long program can require many restarts, so it becomes necessary to
> store the state of the
> program in memory, print it, recompile, and and read the state to start
> I seem to remember a simpler approach (check point restart?) in which
> the state of the .exe
> code is saved and then simply restarted from its current position.
> Is there something like this for restarting an mpi program?
> Thanks, Erik
> Erik Nelson
> Howard Hughes Medical Institute
> 6001 Forest Park Blvd., Room ND10.124
> Dallas, Texas 75235-9050
> p : 214 645 5981
> f : 214 645 5948
> users mailing list