Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-29 09:54:10


On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:

> I'm finding that using ompi-checkpoint on an application which is
> very cpu bound takes a very very long time. For example, trying to
> checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
> more than an hour. The problem is not where I'm dumping checkpoints
> (I've tried local and an nfs mount with plenty of space, and cpu
> intensive apps checkpoint quickly).
>
> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>
> Is this condition common and if so, are there possibly mca paramters
> which could help?

It depends on how you configured Open MPI with checkpoint/restart.
There are two modes of operation: No threads, and with a checkpoint
thread. They are described a bit more in the Checkpoint/Restart Fault
Tolerance User's Guide on the wiki:
   https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

By default we compile without the checkpoint thread. The restriction
he is that all processes must be in the MPI library in order to make
progress on the global checkpoint. For CPU intensive applications this
may cause quite a delay in the time to start, and subsequently finish,
a checkpoint. I'm guessing that this is what you are seeing.

If you configure with the checkpoint thread (add '--enable-mpi-threads
--enable-ft-thread' to ./configure) then Open MPI will create a thread
that runs with each application process. This thread is fairly light
weight and will make sure that a checkpoint progresses even when the
process is not in the Open MPI library.

Try enabling the checkpoint thread and see if that helps improve the
checkpoint time.

-- Josh

>
>
> Thanks,
> Sharon
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users