Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Sharon Brunett (sharon_at_[hidden])
Date: 2008-04-29 15:52:21

Thanks for the quick response. I'll test against some key applications
we would like to use blcr checkpointing/restarting against. Perhaps if
we're lucky and careful, we'll be able to get some near term use out of
what we have installed.


Josh Hursey wrote:
> Sharon,
> This is, unfortunately, to be expected at the moment for this type of
> application. Extremely communication intensive applications will most
> likely cause the implementation of the current coordination algorithm
> to slow down significantly. This is because on a checkpoint Open MPI
> does a peerwise check on the description of (possibly) each message to
> make sure there are no messages in flight. So for a huge number of
> messages this could take a long time.
> This is a performance problem with the current implementation of the
> algorithm that we use in Open MPI. I've been meaning to go back and
> improve this, but it has not been critical to do so since applications
> that perform in this manner are outliers in HPC. The coordination
> algorithm I'm using is based on the algorithm used by LAM/MPI, but
> implemented at a higher level. There are a number of improvements that
> I can explore in the checkpoint/restart framework in Open MPI.
> If this is critical for you I might be able to take a look at it, but
> I can't say when. :(
> -- Josh
> On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:
>> Josh Hursey wrote:
>>> On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
>>>> I'm finding that using ompi-checkpoint on an application which is
>>>> very cpu bound takes a very very long time. For example, trying to
>>>> checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
>>>> more than an hour. The problem is not where I'm dumping checkpoints
>>>> (I've tried local and an nfs mount with plenty of space, and cpu
>>>> intensive apps checkpoint quickly).
>>>> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>>>> Is this condition common and if so, are there possibly mca paramters
>>>> which could help?
>>> It depends on how you configured Open MPI with checkpoint/restart.
>>> There are two modes of operation: No threads, and with a checkpoint
>>> thread. They are described a bit more in the Checkpoint/Restart Fault
>>> Tolerance User's Guide on the wiki:
>>> By default we compile without the checkpoint thread. The restriction
>>> he is that all processes must be in the MPI library in order to make
>>> progress on the global checkpoint. For CPU intensive applications
>>> this
>>> may cause quite a delay in the time to start, and subsequently
>>> finish,
>>> a checkpoint. I'm guessing that this is what you are seeing.
>>> If you configure with the checkpoint thread (add '--enable-mpi-
>>> threads-
>>> --enable-ft-thread' to ./configure) then Open MPI will create a
>>> thread
>>> that runs with each application process. This thread is fairly light
>>> weight and will make sure that a checkpoint progresses even when the
>>> process is not in the Open MPI library.
>>> Try enabling the checkpoint thread and see if that helps improve the
>>> checkpoint time.
>> Josh,
>> First...please pardon the blunder in my earlier mail. Comms bound apps
>> are the ones taking a while to checkpoint, not cpu bound. In any
>> case, I
>> tried configuring with the above two configure options but still no
>> luck
>> on improving checkpointing times or gaining completion on larger mpi
>> task runs being checkpointed.
>> It looks like the checkpointing is just hanging. For example, I can
>> checkpoint a 2 way comms bound code (1 task on two nodes) ok. When I
>> ask
>> for a 4 way run on 2 nodes, 30 minutes after the ompi-checkpoint PID
>> only see 1 ckpt directory with data in it!
>> /home/sharon/ompi_global_snapshot_25400.ckpt/0
>> -bash-2.05b$ ls -l *
>> opal_snapshot_0.ckpt:
>> total 0
>> opal_snapshot_1.ckpt:
>> total 0
>> opal_snapshot_2.ckpt:
>> total 0
>> opal_snapshot_3.ckpt:
>> total 1868
>> -rw------- 1 sharon shc-support 1907476 2008-04-29 10:49
>> ompi_blcr_context.1850
>> -rw-r--r-- 1 sharon shc-support 33 2008-04-29 10:49
>> -bash-2.05b$ pwd
>> The file system getting the checkpoints is local. I've tried /scratch
>> and others as well.
>> I can checkpoint some codes (like xhpl) just fine across 8 mpi tasks
>> ( t
>> nodes), dumping 254M total. Thus, the very long/stuck checkpointing
>> seems rather application dependent.
>> Here's how I configured openmpi
>> ./configure --prefix=/nfs/ds01/support/sharon/openmpi-1.3a1r18241
>> --enable-mpi-threads --enable-ft-thread --with-ft=cr --enable-shared
>> --enable-mpi-threads=posix --enable-libgcj-multifile
>> --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk
>> --with-mvapi=/usr/mellanox --with-blcr=/opt/blcr
>> Thanks for any further insights you may have.
>> Sharon
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]