Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Sharon Brunett (sharon_at_[hidden])
Date: 2008-04-29 16:18:37


Josh,
I'm also having trouble using ompi-restart on a snapspot made from a run
which was previously checkpointed. In other words, restarting a
previously restarted run!

(a) start the run
mpirun -np 16 -am ft-enable-cr ./a.out

   <---do an ompi-checkpoint on the mpirun pid from (a) from another
terminal--->>

(b) restart the checkpointed run

ompi-restart ompi_global_snapshot_30086.ckpt

    <--do an ompi-checkpoint on mpirun pid from (b) from another
terminal---->>

(c) restart the checkpointed run
    ompi-restart ompi_global_snapshot_30120.ckpt

--------------------------------------------------------------------------
mpirun noticed that process rank 12 with PID 30480 on node shc005 exited
on signal 13 (Broken pipe).
--------------------------------------------------------------------------
-bash-2.05b$

I can restart the previous (30086) ckpt but not the latest one made from
a restarted run.

Any insights would be appreciated.

thanks,
Sharon

Josh Hursey wrote:
> Sharon,
>
> This is, unfortunately, to be expected at the moment for this type of
> application. Extremely communication intensive applications will most
> likely cause the implementation of the current coordination algorithm
> to slow down significantly. This is because on a checkpoint Open MPI
> does a peerwise check on the description of (possibly) each message to
> make sure there are no messages in flight. So for a huge number of
> messages this could take a long time.
>
> This is a performance problem with the current implementation of the
> algorithm that we use in Open MPI. I've been meaning to go back and
> improve this, but it has not been critical to do so since applications
> that perform in this manner are outliers in HPC. The coordination
> algorithm I'm using is based on the algorithm used by LAM/MPI, but
> implemented at a higher level. There are a number of improvements that
> I can explore in the checkpoint/restart framework in Open MPI.
>
> If this is critical for you I might be able to take a look at it, but
> I can't say when. :(
>
> -- Josh
>
> On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:
>
>>
>> Josh Hursey wrote:
>>> On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
>>>
>>>> I'm finding that using ompi-checkpoint on an application which is
>>>> very cpu bound takes a very very long time. For example, trying to
>>>> checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
>>>> more than an hour. The problem is not where I'm dumping checkpoints
>>>> (I've tried local and an nfs mount with plenty of space, and cpu
>>>> intensive apps checkpoint quickly).
>>>>
>>>> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>>>>
>>>> Is this condition common and if so, are there possibly mca paramters
>>>> which could help?
>>> It depends on how you configured Open MPI with checkpoint/restart.
>>> There are two modes of operation: No threads, and with a checkpoint
>>> thread. They are described a bit more in the Checkpoint/Restart Fault
>>> Tolerance User's Guide on the wiki:
>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>>
>>> By default we compile without the checkpoint thread. The restriction
>>> he is that all processes must be in the MPI library in order to make
>>> progress on the global checkpoint. For CPU intensive applications
>>> this
>>> may cause quite a delay in the time to start, and subsequently
>>> finish,
>>> a checkpoint. I'm guessing that this is what you are seeing.
>>>
>>> If you configure with the checkpoint thread (add '--enable-mpi-
>>> threads-
>>> --enable-ft-thread' to ./configure) then Open MPI will create a
>>> thread
>>> that runs with each application process. This thread is fairly light
>>> weight and will make sure that a checkpoint progresses even when the
>>> process is not in the Open MPI library.
>>>
>>> Try enabling the checkpoint thread and see if that helps improve the
>>> checkpoint time.
>> Josh,
>> First...please pardon the blunder in my earlier mail. Comms bound apps
>> are the ones taking a while to checkpoint, not cpu bound. In any
>> case, I
>> tried configuring with the above two configure options but still no
>> luck
>> on improving checkpointing times or gaining completion on larger mpi
>> task runs being checkpointed.
>>
>> It looks like the checkpointing is just hanging. For example, I can
>> checkpoint a 2 way comms bound code (1 task on two nodes) ok. When I
>> ask
>> for a 4 way run on 2 nodes, 30 minutes after the ompi-checkpoint PID
>> only see 1 ckpt directory with data in it!
>>
>>
>> /home/sharon/ompi_global_snapshot_25400.ckpt/0
>> -bash-2.05b$ ls -l *
>> opal_snapshot_0.ckpt:
>> total 0
>>
>> opal_snapshot_1.ckpt:
>> total 0
>>
>> opal_snapshot_2.ckpt:
>> total 0
>>
>> opal_snapshot_3.ckpt:
>> total 1868
>> -rw------- 1 sharon shc-support 1907476 2008-04-29 10:49
>> ompi_blcr_context.1850
>> -rw-r--r-- 1 sharon shc-support 33 2008-04-29 10:49
>> snapshot_meta.data
>> -bash-2.05b$ pwd
>>
>>
>> The file system getting the checkpoints is local. I've tried /scratch
>> and others as well.
>>
>> I can checkpoint some codes (like xhpl) just fine across 8 mpi tasks
>> ( t
>> nodes), dumping 254M total. Thus, the very long/stuck checkpointing
>> seems rather application dependent.
>>
>> Here's how I configured openmpi
>>
>> ./configure --prefix=/nfs/ds01/support/sharon/openmpi-1.3a1r18241
>> --enable-mpi-threads --enable-ft-thread --with-ft=cr --enable-shared
>> --enable-mpi-threads=posix --enable-libgcj-multifile
>> --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk
>> --with-mvapi=/usr/mellanox --with-blcr=/opt/blcr
>>
>>
>>
>> Thanks for any further insights you may have.
>> Sharon
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>