Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-29 16:31:56


Your previous email indicted that you were using r18241. I committed
in r18276 a patch that should fix this problem. Let me know if you
still see it after that update.

Cheers,
Josh

On Apr 29, 2008, at 3:18 PM, Sharon Brunett wrote:

> Josh,
> I'm also having trouble using ompi-restart on a snapspot made from a
> run
> which was previously checkpointed. In other words, restarting a
> previously restarted run!
>
> (a) start the run
> mpirun -np 16 -am ft-enable-cr ./a.out
>
> <---do an ompi-checkpoint on the mpirun pid from (a) from another
> terminal--->>
>
> (b) restart the checkpointed run
>
> ompi-restart ompi_global_snapshot_30086.ckpt
>
> <--do an ompi-checkpoint on mpirun pid from (b) from another
> terminal---->>
>
> (c) restart the checkpointed run
> ompi-restart ompi_global_snapshot_30120.ckpt
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 12 with PID 30480 on node shc005
> exited
> on signal 13 (Broken pipe).
> --------------------------------------------------------------------------
> -bash-2.05b$
>
> I can restart the previous (30086) ckpt but not the latest one made
> from
> a restarted run.
>
> Any insights would be appreciated.
>
> thanks,
> Sharon
>
>
>
> Josh Hursey wrote:
>> Sharon,
>>
>> This is, unfortunately, to be expected at the moment for this type of
>> application. Extremely communication intensive applications will most
>> likely cause the implementation of the current coordination algorithm
>> to slow down significantly. This is because on a checkpoint Open MPI
>> does a peerwise check on the description of (possibly) each message
>> to
>> make sure there are no messages in flight. So for a huge number of
>> messages this could take a long time.
>>
>> This is a performance problem with the current implementation of the
>> algorithm that we use in Open MPI. I've been meaning to go back and
>> improve this, but it has not been critical to do so since
>> applications
>> that perform in this manner are outliers in HPC. The coordination
>> algorithm I'm using is based on the algorithm used by LAM/MPI, but
>> implemented at a higher level. There are a number of improvements
>> that
>> I can explore in the checkpoint/restart framework in Open MPI.
>>
>> If this is critical for you I might be able to take a look at it, but
>> I can't say when. :(
>>
>> -- Josh
>>
>> On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:
>>
>>>
>>> Josh Hursey wrote:
>>>> On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
>>>>
>>>>> I'm finding that using ompi-checkpoint on an application which is
>>>>> very cpu bound takes a very very long time. For example, trying to
>>>>> checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
>>>>> more than an hour. The problem is not where I'm dumping
>>>>> checkpoints
>>>>> (I've tried local and an nfs mount with plenty of space, and cpu
>>>>> intensive apps checkpoint quickly).
>>>>>
>>>>> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>>>>>
>>>>> Is this condition common and if so, are there possibly mca
>>>>> paramters
>>>>> which could help?
>>>> It depends on how you configured Open MPI with checkpoint/restart.
>>>> There are two modes of operation: No threads, and with a checkpoint
>>>> thread. They are described a bit more in the Checkpoint/Restart
>>>> Fault
>>>> Tolerance User's Guide on the wiki:
>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>>>
>>>> By default we compile without the checkpoint thread. The
>>>> restriction
>>>> he is that all processes must be in the MPI library in order to
>>>> make
>>>> progress on the global checkpoint. For CPU intensive applications
>>>> this
>>>> may cause quite a delay in the time to start, and subsequently
>>>> finish,
>>>> a checkpoint. I'm guessing that this is what you are seeing.
>>>>
>>>> If you configure with the checkpoint thread (add '--enable-mpi-
>>>> threads-
>>>> --enable-ft-thread' to ./configure) then Open MPI will create a
>>>> thread
>>>> that runs with each application process. This thread is fairly
>>>> light
>>>> weight and will make sure that a checkpoint progresses even when
>>>> the
>>>> process is not in the Open MPI library.
>>>>
>>>> Try enabling the checkpoint thread and see if that helps improve
>>>> the
>>>> checkpoint time.
>>> Josh,
>>> First...please pardon the blunder in my earlier mail. Comms bound
>>> apps
>>> are the ones taking a while to checkpoint, not cpu bound. In any
>>> case, I
>>> tried configuring with the above two configure options but still no
>>> luck
>>> on improving checkpointing times or gaining completion on larger mpi
>>> task runs being checkpointed.
>>>
>>> It looks like the checkpointing is just hanging. For example, I can
>>> checkpoint a 2 way comms bound code (1 task on two nodes) ok. When I
>>> ask
>>> for a 4 way run on 2 nodes, 30 minutes after the ompi-checkpoint PID
>>> only see 1 ckpt directory with data in it!
>>>
>>>
>>> /home/sharon/ompi_global_snapshot_25400.ckpt/0
>>> -bash-2.05b$ ls -l *
>>> opal_snapshot_0.ckpt:
>>> total 0
>>>
>>> opal_snapshot_1.ckpt:
>>> total 0
>>>
>>> opal_snapshot_2.ckpt:
>>> total 0
>>>
>>> opal_snapshot_3.ckpt:
>>> total 1868
>>> -rw------- 1 sharon shc-support 1907476 2008-04-29 10:49
>>> ompi_blcr_context.1850
>>> -rw-r--r-- 1 sharon shc-support 33 2008-04-29 10:49
>>> snapshot_meta.data
>>> -bash-2.05b$ pwd
>>>
>>>
>>> The file system getting the checkpoints is local. I've tried /
>>> scratch
>>> and others as well.
>>>
>>> I can checkpoint some codes (like xhpl) just fine across 8 mpi tasks
>>> ( t
>>> nodes), dumping 254M total. Thus, the very long/stuck checkpointing
>>> seems rather application dependent.
>>>
>>> Here's how I configured openmpi
>>>
>>> ./configure --prefix=/nfs/ds01/support/sharon/openmpi-1.3a1r18241
>>> --enable-mpi-threads --enable-ft-thread --with-ft=cr --enable-shared
>>> --enable-mpi-threads=posix --enable-libgcj-multifile
>>> --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk
>>> --with-mvapi=/usr/mellanox --with-blcr=/opt/blcr
>>>
>>>
>>>
>>> Thanks for any further insights you may have.
>>> Sharon
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users