Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Using BLCR tools to checkpoint Open MPI applications
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-08-02 09:13:03


Eric,

Thanks for the great work on this integration. I filed a ticket for
the problem areas that you highlighted with the Open MPI side of the
integration so we do not lose track of them.
   https://svn.open-mpi.org/trac/ompi/ticket/2842

Hopefully we will get some cycles to address these issues in the near term.

Thanks,
Josh

On Wed, Jul 27, 2011 at 3:52 PM, Eric Roman <ERoman_at_[hidden]> wrote:
>
> Dear Open MPI Developers,
>
> We've been working on using Torque's checkpoint/restart support, along with BLCR
> and Open MPI's C/R support, to perform C/R on parallel jobs running under
> Torque.  The main issue here is that Open MPI requires the use of
> ompi-checkpoint and ompi-restart commands to checkpoint the application, but
> Torque uses cr_checkpoint and cr_restart to checkpoint job scripts, so an
> adapter is required between the two interfaces.  I've written a small program,
> called cr_mpirun, that meets this purpose.
>
> This code is now available on the BLCR web site that should enable you to use
> BLCR cr_checkpoint and cr_restart commands to checkpoint Open MPI applications.
> You can download it at the following URL:
>
> https://upc-bugs.lbl.gov/blcr-dist/cr_mpirun/cr_mpirun-210.tar.gz
>
> This code can be used fairly reliably to invoke cr_checkpoint and cr_restart on
> Open MPI applications.  In turn, this enables you to use Torque's
> checkpoint/restart support on parallel jobs.  I've tested mainly with qhold and
> qrls, but have also experimented with using Maui's preemptee and preemptor
> classes.
>
> This release is intended as a development release, meaning that this release is
> not suitable for general production use, but should be used for testing.  There
> are a number of issues that need to be worked out, and we need feedback from
> Torque and Open MPI developers, and from users interested in testing or filing
> bug reports.
>
> There is a list of known issues documented in the BUGS file in the release.
> There are HOWTO files in the release that describe the implementation,
> workarounds for current problems, and usage of cr_mpirun.
>
> Thanks for your interest.
>
> Sincerely,
> Eric Roman
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey