Your previous email indicted that you were using r18241. I committed
in r18276 a patch that should fix this problem. Let me know if you
still see it after that update.
On Apr 29, 2008, at 3:18 PM, Sharon Brunett wrote:
> I'm also having trouble using ompi-restart on a snapspot made from a
> which was previously checkpointed. In other words, restarting a
> previously restarted run!
> (a) start the run
> mpirun -np 16 -am ft-enable-cr ./a.out
> <---do an ompi-checkpoint on the mpirun pid from (a) from another
> (b) restart the checkpointed run
> ompi-restart ompi_global_snapshot_30086.ckpt
> <--do an ompi-checkpoint on mpirun pid from (b) from another
> (c) restart the checkpointed run
> ompi-restart ompi_global_snapshot_30120.ckpt
> mpirun noticed that process rank 12 with PID 30480 on node shc005
> on signal 13 (Broken pipe).
> I can restart the previous (30086) ckpt but not the latest one made
> a restarted run.
> Any insights would be appreciated.
> Josh Hursey wrote:
>> This is, unfortunately, to be expected at the moment for this type of
>> application. Extremely communication intensive applications will most
>> likely cause the implementation of the current coordination algorithm
>> to slow down significantly. This is because on a checkpoint Open MPI
>> does a peerwise check on the description of (possibly) each message
>> make sure there are no messages in flight. So for a huge number of
>> messages this could take a long time.
>> This is a performance problem with the current implementation of the
>> algorithm that we use in Open MPI. I've been meaning to go back and
>> improve this, but it has not been critical to do so since
>> that perform in this manner are outliers in HPC. The coordination
>> algorithm I'm using is based on the algorithm used by LAM/MPI, but
>> implemented at a higher level. There are a number of improvements
>> I can explore in the checkpoint/restart framework in Open MPI.
>> If this is critical for you I might be able to take a look at it, but
>> I can't say when. :(
>> -- Josh
>> On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:
>>> Josh Hursey wrote:
>>>> On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
>>>>> I'm finding that using ompi-checkpoint on an application which is
>>>>> very cpu bound takes a very very long time. For example, trying to
>>>>> checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
>>>>> more than an hour. The problem is not where I'm dumping
>>>>> (I've tried local and an nfs mount with plenty of space, and cpu
>>>>> intensive apps checkpoint quickly).
>>>>> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>>>>> Is this condition common and if so, are there possibly mca
>>>>> which could help?
>>>> It depends on how you configured Open MPI with checkpoint/restart.
>>>> There are two modes of operation: No threads, and with a checkpoint
>>>> thread. They are described a bit more in the Checkpoint/Restart
>>>> Tolerance User's Guide on the wiki:
>>>> By default we compile without the checkpoint thread. The
>>>> he is that all processes must be in the MPI library in order to
>>>> progress on the global checkpoint. For CPU intensive applications
>>>> may cause quite a delay in the time to start, and subsequently
>>>> a checkpoint. I'm guessing that this is what you are seeing.
>>>> If you configure with the checkpoint thread (add '--enable-mpi-
>>>> --enable-ft-thread' to ./configure) then Open MPI will create a
>>>> that runs with each application process. This thread is fairly
>>>> weight and will make sure that a checkpoint progresses even when
>>>> process is not in the Open MPI library.
>>>> Try enabling the checkpoint thread and see if that helps improve
>>>> checkpoint time.
>>> First...please pardon the blunder in my earlier mail. Comms bound
>>> are the ones taking a while to checkpoint, not cpu bound. In any
>>> case, I
>>> tried configuring with the above two configure options but still no
>>> on improving checkpointing times or gaining completion on larger mpi
>>> task runs being checkpointed.
>>> It looks like the checkpointing is just hanging. For example, I can
>>> checkpoint a 2 way comms bound code (1 task on two nodes) ok. When I
>>> for a 4 way run on 2 nodes, 30 minutes after the ompi-checkpoint PID
>>> only see 1 ckpt directory with data in it!
>>> -bash-2.05b$ ls -l *
>>> total 0
>>> total 0
>>> total 0
>>> total 1868
>>> -rw------- 1 sharon shc-support 1907476 2008-04-29 10:49
>>> -rw-r--r-- 1 sharon shc-support 33 2008-04-29 10:49
>>> -bash-2.05b$ pwd
>>> The file system getting the checkpoints is local. I've tried /
>>> and others as well.
>>> I can checkpoint some codes (like xhpl) just fine across 8 mpi tasks
>>> ( t
>>> nodes), dumping 254M total. Thus, the very long/stuck checkpointing
>>> seems rather application dependent.
>>> Here's how I configured openmpi
>>> ./configure --prefix=/nfs/ds01/support/sharon/openmpi-1.3a1r18241
>>> --enable-mpi-threads --enable-ft-thread --with-ft=cr --enable-shared
>>> --enable-mpi-threads=posix --enable-libgcj-multifile
>>> --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk
>>> --with-mvapi=/usr/mellanox --with-blcr=/opt/blcr
>>> Thanks for any further insights you may have.
>>> users mailing list
>> users mailing list
> users mailing list