Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpoint an MPI process
From: Rodrigo Oliveira (rsilva.oliveira_at_[hidden])
Date: 2012-01-20 08:31:12


I appreciate your help.

Indeed, it's better to create my own mechanism as mentioned Lloyd. Actually
my application is a framework to stream processing (something like IBM
System-S), in which I use Open MPI as communication layer and part of
process management. One of this framework's features is to provide a
dynamic load balance mechanism. In some situations I need to move processes
between machines or temporally suspend their execution. To achieve this, I
need a checkpoint/restart mechanism. It is the reason of my question.

Thanks again.

Rodrigo Silva Oliveira
M.Sc. Student - Computer Science
Universidade Federal de Minas Gerais
www.dcc.ufmg.br/~rsilva <http://www.dcc.ufmg.br/%7Ersilva>

On Thu, Jan 19, 2012 at 1:18 PM, Lloyd Brown <lloyd_brown_at_[hidden]> wrote:

> Since you're looking for a function call, I'm going to assume that you
> are writing this application, and it's not a pre-compiled, commercial
> application. Given that, it's going to be significantly better to have
> an internal application checkpointing mechanism, where it serializes and
> stores the data, etc., than to use an external, applicaiton-agnostic
> checkpointing mechanism like BLCR or similar. The application should be
> aware of what data is important, how to most efficiently store it, etc.
> A generic library has to assume that everything is important, and store
> it all.
>
> Don't get me wrong. Libraries like BLCR are great for applications that
> don't have that visibility, and even as a tool for the
> application-internal checkpointing mechanism (where the application
> deliberately interacts with the library to annotate what's important to
> store, and how to do so, etc.). But if you're writing the application,
> you're better off to handle it internally, than externally.
>
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
>
> On 01/19/2012 08:05 AM, Josh Hursey wrote:
> > Currently Open MPI only supports the checkpointing of the whole
> > application. There has been some work on uncoordinated checkpointing
> > with message logging, though I do not know the state of that work with
> > regards to availability. That work has been undertaken by the University
> > of Tennessee Knoxville, so maybe they can provide more information.
> >
> > -- Josh
> >
> > On Wed, Jan 18, 2012 at 3:24 PM, Rodrigo Oliveira
> > <rsilva.oliveira_at_[hidden] <mailto:rsilva.oliveira_at_[hidden]>> wrote:
> >
> > Hi,
> >
> > I'd like to know if there is a way to checkpoint a specific process
> > running under an mpirun call. In other words, is there a function
> > CHECKPOINT(rank) in which I can pass the rank of the process I want
> > to checkpoint? I do not want to checkpoint the entire application,
> > but just one of its processes.
> >
> > Thanks
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden] <mailto:users_at_[hidden]>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> > --
> > Joshua Hursey
> > Postdoctoral Research Associate
> > Oak Ridge National Laboratory
> > http://users.nccs.gov/~jjhursey
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>