Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Sharon Brunett (sharon_at_[hidden])
Date: 2008-04-29 16:37:39


Thanks, I'll try the version you recommend below!

Josh Hursey wrote:
> Your previous email indicted that you were using r18241. I committed
> in r18276 a patch that should fix this problem. Let me know if you
> still see it after that update.
>
> Cheers,
> Josh
>
> On Apr 29, 2008, at 3:18 PM, Sharon Brunett wrote:
>
>> Josh,
>> I'm also having trouble using ompi-restart on a snapspot made from a
>> run
>> which was previously checkpointed. In other words, restarting a
>> previously restarted run!
>>
>> (a) start the run
>> mpirun -np 16 -am ft-enable-cr ./a.out
>>
>> <---do an ompi-checkpoint on the mpirun pid from (a) from another
>> terminal--->>
>>
>> (b) restart the checkpointed run
>>
>> ompi-restart ompi_global_snapshot_30086.ckpt
>>
>> <--do an ompi-checkpoint on mpirun pid from (b) from another
>> terminal---->>
>>
>> (c) restart the checkpointed run
>> ompi-restart ompi_global_snapshot_30120.ckpt
>>
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 12 with PID 30480 on node shc005
>> exited
>> on signal 13 (Broken pipe).
>> --------------------------------------------------------------------------
>> -bash-2.05b$
>>
>> I can restart the previous (30086) ckpt but not the latest one made
>> from
>> a restarted run.
>>
>> Any insights would be appreciated.
>>
>> thanks,
>> Sharon
>>
>>
>>
>> Josh Hursey wrote:
>>> Sharon,
>>>
>>> This is, unfortunately, to be expected at the moment for this type of
>>> application. Extremely communication intensive applications will most
>>> likely cause the implementation of the current coordination algorithm
>>> to slow down significantly. This is because on a checkpoint Open MPI
>>> does a peerwise check on the description of (possibly) each message
>>> to
>>> make sure there are no messages in flight. So for a huge number of
>>> messages this could take a long time.
>>>
>>> This is a performance problem with the current implementation of the
>>> algorithm that we use in Open MPI. I've been meaning to go back and
>>> improve this, but it has not been critical to do so since
>>> applications
>>> that perform in this manner are outliers in HPC. The coordination
>>> algorithm I'm using is based on the algorithm used by LAM/MPI, but
>>> implemented at a higher level. There are a number of improvements
>>> that
>>> I can explore in the checkpoint/restart framework in Open MPI.
>>>
>>> If this is critical for you I might be able to take a look at it, but
>>> I can't say when. :(
>>>
>>> -- Josh
>>>
>>> On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:
>>>
>>>> Josh Hursey wrote:
>>>>> On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
>>>>>
>>>>>> I'm finding that using ompi-checkpoint on an application which is
>>>>>> very cpu bound takes a very very long time. For example, trying to
>>>>>> checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
>>>>>> more than an hour. The problem is not where I'm dumping
>>>>>> checkpoints
>>>>>> (I've tried local and an nfs mount with plenty of space, and cpu
>>>>>> intensive apps checkpoint quickly).
>>>>>>
>>>>>> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>>>>>>
>>>>>> Is this condition common and if so, are there possibly mca
>>>>>> paramters
>>>>>> which could help?
>>>>> It depends on how you configured Open MPI with checkpoint/restart.
>>>>> There are two modes of operation: No threads, and with a checkpoint
>>>>> thread. They are described a bit more in the Checkpoint/Restart
>>>>> Fault
>>>>> Tolerance User's Guide on the wiki:
>>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>>>>
>>>>> By default we compile without the checkpoint thread. The
>>>>> restriction
>>>>> he is that all processes must be in the MPI library in order to
>>>>> make
>>>>> progress on the global checkpoint. For CPU intensive applications
>>>>> this
>>>>> may cause quite a delay in the time to start, and subsequently
>>>>> finish,
>>>>> a checkpoint. I'm guessing that this is what you are seeing.
>>>>>
>>>>> If you configure with the checkpoint thread (add '--enable-mpi-
>>>>> threads-
>>>>> --enable-ft-thread' to ./configure) then Open MPI will create a
>>>>> thread
>>>>> that runs with each application process. This thread is fairly
>>>>> light
>>>>> weight and will make sure that a checkpoint progresses even when
>>>>> the
>>>>> process is not in the Open MPI library.
>>>>>
>>>>> Try enabling the checkpoint thread and see if that helps improve
>>>>> the
>>>>> checkpoint time.
>>>> Josh,
>>>> First...please pardon the blunder in my earlier mail. Comms bound
>>>> apps
>>>> are the ones taking a while to checkpoint, not cpu bound. In any
>>>> case, I
>>>> tried configuring with the above two configure options but still no
>>>> luck
>>>> on improving checkpointing times or gaining completion on larger mpi
>>>> task runs being checkpointed.
>>>>
>>>> It looks like the checkpointing is just hanging. For example, I can
>>>> checkpoint a 2 way comms bound code (1 task on two nodes) ok. When I
>>>> ask
>>>> for a 4 way run on 2 nodes, 30 minutes after the ompi-checkpoint PID
>>>> only see 1 ckpt directory with data in it!
>>>>
>>>>
>>>> /home/sharon/ompi_global_snapshot_25400.ckpt/0
>>>> -bash-2.05b$ ls -l *
>>>> opal_snapshot_0.ckpt:
>>>> total 0
>>>>
>>>> opal_snapshot_1.ckpt:
>>>> total 0
>>>>
>>>> opal_snapshot_2.ckpt:
>>>> total 0
>>>>
>>>> opal_snapshot_3.ckpt:
>>>> total 1868
>>>> -rw------- 1 sharon shc-support 1907476 2008-04-29 10:49
>>>> ompi_blcr_context.1850
>>>> -rw-r--r-- 1 sharon shc-support 33 2008-04-29 10:49
>>>> snapshot_meta.data
>>>> -bash-2.05b$ pwd
>>>>
>>>>
>>>> The file system getting the checkpoints is local. I've tried /
>>>> scratch
>>>> and others as well.
>>>>
>>>> I can checkpoint some codes (like xhpl) just fine across 8 mpi tasks
>>>> ( t
>>>> nodes), dumping 254M total. Thus, the very long/stuck checkpointing
>>>> seems rather application dependent.
>>>>
>>>> Here's how I configured openmpi
>>>>
>>>> ./configure --prefix=/nfs/ds01/support/sharon/openmpi-1.3a1r18241
>>>> --enable-mpi-threads --enable-ft-thread --with-ft=cr --enable-shared
>>>> --enable-mpi-threads=posix --enable-libgcj-multifile
>>>> --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk
>>>> --with-mvapi=/usr/mellanox --with-blcr=/opt/blcr
>>>>
>>>>
>>>>
>>>> Thanks for any further insights you may have.
>>>> Sharon
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>