Rodrigo,

Open MPI has the ability to migrate a subset of processes (in the trunk - though currently broken due to recent code movement, I'm slowing developing the fix in my spare time). The current implementation only checkpoints the migrating processes, but suspends all other processes during the migration activity. There has been some work on providing more of a live migration mechanism in Open MPI (where non-migrating processes are not suspended), but I do not know the state of that work. The original work was integrated into LAM/MPI by Chao Wang and Frank Mueller at North Carolina State University and depended on some, yet, unreleased features of BLCR.

Open MPI also has the ability to suspend a job via SIGSTOP/SIGCONT without the need for checkpoint, but it applies to the whole job. A while back, I enhanced that feature such that a checkpoint is established before the SIGSTOP is processed, so that a user can terminate and restart the job if they wish instead of just being able to SIGCONT.

So these features are not quite what you are looking for, but could be used as a starting point for future development if someone was so motivated. A short term alternative is to use a virtual machine that provides the migration functionality you are looking for, though at the additional cost of a virtual machine interposition layer.

-- Josh

On Fri, Jan 20, 2012 at 8:31 AM, Rodrigo Oliveira <rsilva.oliveira@gmail.com> wrote:
I appreciate your help.

Indeed, it's better to create my own mechanism as mentioned Lloyd. Actually my application is a framework to stream processing (something like IBM System-S), in which I use Open MPI as communication layer and part of process management. One of this framework's features is to provide a dynamic load balance mechanism. In some situations I need to move processes between machines or temporally suspend their execution. To achieve this, I need a checkpoint/restart mechanism. It is the reason of my question.

Thanks again.


Rodrigo Silva Oliveira
M.Sc. Student - Computer Science
Universidade Federal de Minas Gerais
www.dcc.ufmg.br/~rsilva





On Thu, Jan 19, 2012 at 1:18 PM, Lloyd Brown <lloyd_brown@byu.edu> wrote:
Since you're looking for a function call, I'm going to assume that you
are writing this application, and it's not a pre-compiled, commercial
application.  Given that, it's going to be significantly better to have
an internal application checkpointing mechanism, where it serializes and
stores the data, etc., than to use an external, applicaiton-agnostic
checkpointing mechanism like BLCR or similar.  The application should be
aware of what data is important, how to most efficiently store it, etc.
 A generic library has to assume that everything is important, and store
it all.

Don't get me wrong.  Libraries like BLCR are great for applications that
don't have that visibility, and even as a tool for the
application-internal checkpointing mechanism (where the application
deliberately interacts with the library to annotate what's important to
store, and how to do so, etc.).  But if you're writing the application,
you're better off to handle it internally, than externally.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 01/19/2012 08:05 AM, Josh Hursey wrote:
> Currently Open MPI only supports the checkpointing of the whole
> application. There has been some work on uncoordinated checkpointing
> with message logging, though I do not know the state of that work with
> regards to availability. That work has been undertaken by the University
> of Tennessee Knoxville, so maybe they can provide more information.
>
> -- Josh
>
> On Wed, Jan 18, 2012 at 3:24 PM, Rodrigo Oliveira
> <rsilva.oliveira@gmail.com <mailto:rsilva.oliveira@gmail.com>> wrote:
>
>     Hi,
>
>     I'd like to know if there is a way to checkpoint a specific process
>     running under an mpirun call. In other words, is there a function
>     CHECKPOINT(rank) in which I can pass the rank of the process I want
>     to checkpoint? I do not want to checkpoint the entire application,
>     but just one of its processes.
>
>     Thanks
>
>     _______________________________________________
>     users mailing list
>     users@open-mpi.org <mailto:users@open-mpi.org>
>     http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey