Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpoint an MPI process
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2012-01-20 09:13:08


Open MPI has the ability to migrate a subset of processes (in the trunk -
though currently broken due to recent code movement, I'm slowing developing
the fix in my spare time). The current implementation only checkpoints the
migrating processes, but suspends all other processes during the migration
activity. There has been some work on providing more of a live migration
mechanism in Open MPI (where non-migrating processes are not suspended),
but I do not know the state of that work. The original work was integrated
into LAM/MPI by Chao Wang and Frank Mueller at North Carolina State
University and depended on some, yet, unreleased features of BLCR.

Open MPI also has the ability to suspend a job via SIGSTOP/SIGCONT without
the need for checkpoint, but it applies to the whole job. A while back, I
enhanced that feature such that a checkpoint is established before the
SIGSTOP is processed, so that a user can terminate and restart the job if
they wish instead of just being able to SIGCONT.

So these features are not quite what you are looking for, but could be used
as a starting point for future development if someone was so motivated. A
short term alternative is to use a virtual machine that provides the
migration functionality you are looking for, though at the additional cost
of a virtual machine interposition layer.

-- Josh

On Fri, Jan 20, 2012 at 8:31 AM, Rodrigo Oliveira <rsilva.oliveira_at_[hidden]
> wrote:

> I appreciate your help.
> Indeed, it's better to create my own mechanism as mentioned
> Lloyd. Actually my application is a framework to stream processing
> (something like IBM System-S), in which I use Open MPI as communication
> layer and part of process management. One of this framework's features is
> to provide a dynamic load balance mechanism. In some situations I need to
> move processes between machines or temporally suspend their execution. To
> achieve this, I need a checkpoint/restart mechanism. It is the reason of my
> question.
> Thanks again.
> Rodrigo Silva Oliveira
> M.Sc. Student - Computer Science
> Universidade Federal de Minas Gerais
> <>
> On Thu, Jan 19, 2012 at 1:18 PM, Lloyd Brown <lloyd_brown_at_[hidden]> wrote:
>> Since you're looking for a function call, I'm going to assume that you
>> are writing this application, and it's not a pre-compiled, commercial
>> application. Given that, it's going to be significantly better to have
>> an internal application checkpointing mechanism, where it serializes and
>> stores the data, etc., than to use an external, applicaiton-agnostic
>> checkpointing mechanism like BLCR or similar. The application should be
>> aware of what data is important, how to most efficiently store it, etc.
>> A generic library has to assume that everything is important, and store
>> it all.
>> Don't get me wrong. Libraries like BLCR are great for applications that
>> don't have that visibility, and even as a tool for the
>> application-internal checkpointing mechanism (where the application
>> deliberately interacts with the library to annotate what's important to
>> store, and how to do so, etc.). But if you're writing the application,
>> you're better off to handle it internally, than externally.
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> On 01/19/2012 08:05 AM, Josh Hursey wrote:
>> > Currently Open MPI only supports the checkpointing of the whole
>> > application. There has been some work on uncoordinated checkpointing
>> > with message logging, though I do not know the state of that work with
>> > regards to availability. That work has been undertaken by the University
>> > of Tennessee Knoxville, so maybe they can provide more information.
>> >
>> > -- Josh
>> >
>> > On Wed, Jan 18, 2012 at 3:24 PM, Rodrigo Oliveira
>> > <rsilva.oliveira_at_[hidden] <mailto:rsilva.oliveira_at_[hidden]>> wrote:
>> >
>> > Hi,
>> >
>> > I'd like to know if there is a way to checkpoint a specific process
>> > running under an mpirun call. In other words, is there a function
>> > CHECKPOINT(rank) in which I can pass the rank of the process I want
>> > to checkpoint? I do not want to checkpoint the entire application,
>> > but just one of its processes.
>> >
>> > Thanks
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden] <mailto:users_at_[hidden]>
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Joshua Hursey
>> > Postdoctoral Research Associate
>> > Oak Ridge National Laboratory
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> >
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory