Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Checkpoint an MPI process
From: Rodrigo Oliveira (rsilva.oliveira_at_[hidden])
Date: 2012-01-20 08:31:12


I appreciate your help.

Indeed, it's better to create my own mechanism as mentioned Lloyd. Actually
my application is a framework to stream processing (something like IBM
System-S), in which I use Open MPI as communication layer and part of
process management. One of this framework's features is to provide a
dynamic load balance mechanism. In some situations I need to move processes
between machines or temporally suspend their execution. To achieve this, I
need a checkpoint/restart mechanism. It is the reason of my question.

Thanks again.

Rodrigo Silva Oliveira
M.Sc. Student - Computer Science
Universidade Federal de Minas Gerais
www.dcc.ufmg.br/~rsilva <http://www.dcc.ufmg.br/%7Ersilva>

On Thu, Jan 19, 2012 at 1:18 PM, Lloyd Brown <lloyd_brown_at_[hidden]> wrote:

> Since you're looking for a function call, I'm going to assume that you
> are writing this application, and it's not a pre-compiled, commercial
> application. Given that, it's going to be significantly better to have
> an internal application checkpointing mechanism, where it serializes and
> stores the data, etc., than to use an external, applicaiton-agnostic
> checkpointing mechanism like BLCR or similar. The application should be
> aware of what data is important, how to most efficiently store it, etc.
> A generic library has to assume that everything is important, and store
> it all.
>
> Don't get me wrong. Libraries like BLCR are great for applications that
> don't have that visibility, and even as a tool for the
> application-internal checkpointing mechanism (where the application
> deliberately interacts with the library to annotate what's important to
> store, and how to do so, etc.). But if you're writing the application,
> you're better off to handle it internally, than externally.
>
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
>
> On 01/19/2012 08:05 AM, Josh Hursey wrote:
> > Currently Open MPI only supports the checkpointing of the whole
> > application. There has been some work on uncoordinated checkpointing
> > with message logging, though I do not know the state of that work with
> > regards to availability. That work has been undertaken by the University
> > of Tennessee Knoxville, so maybe they can provide more information.
> >
> > -- Josh
> >
> > On Wed, Jan 18, 2012 at 3:24 PM, Rodrigo Oliveira
> > <rsilva.oliveira_at_[hidden] <mailto:rsilva.oliveira_at_[hidden]>> wrote:
> >
> > Hi,
> >
> > I'd like to know if there is a way to checkpoint a specific process
> > running under an mpirun call. In other words, is there a function
> > CHECKPOINT(rank) in which I can pass the rank of the process I want
> > to checkpoint? I do not want to checkpoint the entire application,
> > but just one of its processes.
> >
> > Thanks
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden] <mailto:users_at_[hidden]>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> > --
> > Joshua Hursey
> > Postdoctoral Research Associate
> > Oak Ridge National Laboratory
> > http://users.nccs.gov/~jjhursey
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>