The amount of checkpoint overhead is application and system configuration specific. So it is impossible to give you a good answer to how much checkpoint overhead to expect for your application and system setup.
BLCR is only used to capture the single process image. The coordination of the distributed checkpoint includes:
- the time to initiate the checkpoint,
- the time to marshall the network (we currently use an all-to-all bookmark exchange, similar to to what LAM/MPI used),
- Store the local checkpoints to stable storage,
- Verify that all of the local checkpoints have been stored successfully, and
- Return the handle to the end user.
The bulk of the time is spent saving the local checkpoints (a.k.a. snapshots) to stable storage. By default Open MPI saves directly to a globally mounted storage device. So the application is stalled until the checkpoint is complete (checkpoint overhead = checkpoint latency). You can also enable checkpoint staging in which the application saves the checkpoint to a local disk. After which the local daemon stages the file back to stable storage while the application continues execution (checkpoint overhead << checkpoint latency).
If you are concerned with scaling, definitely look at the staging technique.
Does that help?
On Jul 7, 2010, at 12:25 PM, Nguyen Toan wrote:
> Hello everyone,
> I have a question concerning the checkpoint overhead in Open MPI, which is the difference taken from the runtime of application execution with and without checkpoint.
> I observe that when the data size and the number of processes increases, the runtime of BLCR is very small compared to the overall checkpoint overhead in Open MPI. Is it because of the increase of coordination time for checkpoint? And what is included in the overall checkpoint overhead besides the BLCR's checkpoint overhead and coordination time?
> Thank you.
> Best Regards,
> Nguyen Toan
> users mailing list
users mailing list