Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-05-14 15:55:13


I just pushed in some new timing code for the CRCP Coord component in
r18439.
  https://svn.open-mpi.org/trac/ompi/changeset/1843

This should allow you to see the checkpoint progress through the
coordination protocol, and provide some rough timing on the different
parts of the algorithm.

To activate it add the MCA parameter "-mca crcp_coord_timing 2" to
your mpirun command line.

No algorithmic changes were added, so this will not fix the problem,
just give you a perspective of it's activity.

-- Josh

On May 14, 2008, at 1:11 PM, Josh Hursey wrote:

> Tamer,
>
> How much communication does your application tend to do? As reported
> below if there is a lot of communication between checkpoints then it
> may take a while to checkpoint the application since the current
> implementation of the coordination algorithm checks every message at
> checkpoint time. So what you are seeing might be that the checkpoint
> is taking an extremely long time to clear the channel.
>
> I have a few things in the works that attempt to fix this problem.
> They are not ready just yet, but I'll make it known when they are. You
> can get some diagnostics be setting "-mca crcp_coord_verbose 10" on
> the command line, but it is fairly course gained at the moment (I have
> some improvements in the pipeline here as well).
>
> Cheers,
> Josh
>
> On May 13, 2008, at 3:42 PM, Tamer wrote:
>
>> Hi Josh: I am currently using openmpi r18291 and when I run a 12
>> task job on 3 quad core nodes I am able to checkpoint and restart
>> several times at the beginning of the run, however, after a few
>> hours, when I try to checkpoint the code just hangs and it just
>> won't checkpoint and won't give me an error message. Has this
>> problem been reported before? All the required executables and
>> libraries are in my path.
>>
>> Thanks,
>> Tamer
>>
>>
>> On Apr 29, 2008, at 1:37 PM, Sharon Brunett wrote:
>>
>>> Thanks, I'll try the version you recommend below!
>>>
>>> Josh Hursey wrote:
>>>> Your previous email indicted that you were using r18241. I
>>>> committed
>>>> in r18276 a patch that should fix this problem. Let me know if you
>>>> still see it after that update.
>>>>
>>>> Cheers,
>>>> Josh
>>>>
>>>> On Apr 29, 2008, at 3:18 PM, Sharon Brunett wrote:
>>>>
>>>>> Josh,
>>>>> I'm also having trouble using ompi-restart on a snapspot made
>>>>> from a
>>>>> run
>>>>> which was previously checkpointed. In other words, restarting a
>>>>> previously restarted run!
>>>>>
>>>>> (a) start the run
>>>>> mpirun -np 16 -am ft-enable-cr ./a.out
>>>>>
>>>>> <---do an ompi-checkpoint on the mpirun pid from (a) from another
>>>>> terminal--->>
>>>>>
>>>>> (b) restart the checkpointed run
>>>>>
>>>>> ompi-restart ompi_global_snapshot_30086.ckpt
>>>>>
>>>>> <--do an ompi-checkpoint on mpirun pid from (b) from another
>>>>> terminal---->>
>>>>>
>>>>> (c) restart the checkpointed run
>>>>> ompi-restart ompi_global_snapshot_30120.ckpt
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 12 with PID 30480 on node shc005
>>>>> exited
>>>>> on signal 13 (Broken pipe).
>>>>> --------------------------------------------------------------------------
>>>>> -bash-2.05b$
>>>>>
>>>>> I can restart the previous (30086) ckpt but not the latest one
>>>>> made
>>>>> from
>>>>> a restarted run.
>>>>>
>>>>> Any insights would be appreciated.
>>>>>
>>>>> thanks,
>>>>> Sharon
>>>>>
>>>>>
>>>>>
>>>>> Josh Hursey wrote:
>>>>>> Sharon,
>>>>>>
>>>>>> This is, unfortunately, to be expected at the moment for this
>>>>>> type of
>>>>>> application. Extremely communication intensive applications will
>>>>>> most
>>>>>> likely cause the implementation of the current coordination
>>>>>> algorithm
>>>>>> to slow down significantly. This is because on a checkpoint Open
>>>>>> MPI
>>>>>> does a peerwise check on the description of (possibly) each
>>>>>> message
>>>>>> to
>>>>>> make sure there are no messages in flight. So for a huge number
>>>>>> of
>>>>>> messages this could take a long time.
>>>>>>
>>>>>> This is a performance problem with the current implementation of
>>>>>> the
>>>>>> algorithm that we use in Open MPI. I've been meaning to go back
>>>>>> and
>>>>>> improve this, but it has not been critical to do so since
>>>>>> applications
>>>>>> that perform in this manner are outliers in HPC. The coordination
>>>>>> algorithm I'm using is based on the algorithm used by LAM/MPI,
>>>>>> but
>>>>>> implemented at a higher level. There are a number of improvements
>>>>>> that
>>>>>> I can explore in the checkpoint/restart framework in Open MPI.
>>>>>>
>>>>>> If this is critical for you I might be able to take a look at
>>>>>> it, but
>>>>>> I can't say when. :(
>>>>>>
>>>>>> -- Josh
>>>>>>
>>>>>> On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:
>>>>>>
>>>>>>> Josh Hursey wrote:
>>>>>>>> On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
>>>>>>>>
>>>>>>>>> I'm finding that using ompi-checkpoint on an application
>>>>>>>>> which is
>>>>>>>>> very cpu bound takes a very very long time. For example,
>>>>>>>>> trying to
>>>>>>>>> checkpoint a 4 or 8 way Pallas MPI Benchmark application can
>>>>>>>>> take
>>>>>>>>> more than an hour. The problem is not where I'm dumping
>>>>>>>>> checkpoints
>>>>>>>>> (I've tried local and an nfs mount with plenty of space, and
>>>>>>>>> cpu
>>>>>>>>> intensive apps checkpoint quickly).
>>>>>>>>>
>>>>>>>>> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>>>>>>>>>
>>>>>>>>> Is this condition common and if so, are there possibly mca
>>>>>>>>> paramters
>>>>>>>>> which could help?
>>>>>>>> It depends on how you configured Open MPI with checkpoint/
>>>>>>>> restart.
>>>>>>>> There are two modes of operation: No threads, and with a
>>>>>>>> checkpoint
>>>>>>>> thread. They are described a bit more in the Checkpoint/Restart
>>>>>>>> Fault
>>>>>>>> Tolerance User's Guide on the wiki:
>>>>>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>>>>>>>
>>>>>>>> By default we compile without the checkpoint thread. The
>>>>>>>> restriction
>>>>>>>> he is that all processes must be in the MPI library in order to
>>>>>>>> make
>>>>>>>> progress on the global checkpoint. For CPU intensive
>>>>>>>> applications
>>>>>>>> this
>>>>>>>> may cause quite a delay in the time to start, and subsequently
>>>>>>>> finish,
>>>>>>>> a checkpoint. I'm guessing that this is what you are seeing.
>>>>>>>>
>>>>>>>> If you configure with the checkpoint thread (add '--enable-mpi-
>>>>>>>> threads-
>>>>>>>> --enable-ft-thread' to ./configure) then Open MPI will create a
>>>>>>>> thread
>>>>>>>> that runs with each application process. This thread is fairly
>>>>>>>> light
>>>>>>>> weight and will make sure that a checkpoint progresses even
>>>>>>>> when
>>>>>>>> the
>>>>>>>> process is not in the Open MPI library.
>>>>>>>>
>>>>>>>> Try enabling the checkpoint thread and see if that helps
>>>>>>>> improve
>>>>>>>> the
>>>>>>>> checkpoint time.
>>>>>>> Josh,
>>>>>>> First...please pardon the blunder in my earlier mail. Comms
>>>>>>> bound
>>>>>>> apps
>>>>>>> are the ones taking a while to checkpoint, not cpu bound. In any
>>>>>>> case, I
>>>>>>> tried configuring with the above two configure options but
>>>>>>> still no
>>>>>>> luck
>>>>>>> on improving checkpointing times or gaining completion on
>>>>>>> larger mpi
>>>>>>> task runs being checkpointed.
>>>>>>>
>>>>>>> It looks like the checkpointing is just hanging. For example, I
>>>>>>> can
>>>>>>> checkpoint a 2 way comms bound code (1 task on two nodes) ok.
>>>>>>> When I
>>>>>>> ask
>>>>>>> for a 4 way run on 2 nodes, 30 minutes after the ompi-
>>>>>>> checkpoint PID
>>>>>>> only see 1 ckpt directory with data in it!
>>>>>>>
>>>>>>>
>>>>>>> /home/sharon/ompi_global_snapshot_25400.ckpt/0
>>>>>>> -bash-2.05b$ ls -l *
>>>>>>> opal_snapshot_0.ckpt:
>>>>>>> total 0
>>>>>>>
>>>>>>> opal_snapshot_1.ckpt:
>>>>>>> total 0
>>>>>>>
>>>>>>> opal_snapshot_2.ckpt:
>>>>>>> total 0
>>>>>>>
>>>>>>> opal_snapshot_3.ckpt:
>>>>>>> total 1868
>>>>>>> -rw------- 1 sharon shc-support 1907476 2008-04-29 10:49
>>>>>>> ompi_blcr_context.1850
>>>>>>> -rw-r--r-- 1 sharon shc-support 33 2008-04-29 10:49
>>>>>>> snapshot_meta.data
>>>>>>> -bash-2.05b$ pwd
>>>>>>>
>>>>>>>
>>>>>>> The file system getting the checkpoints is local. I've tried /
>>>>>>> scratch
>>>>>>> and others as well.
>>>>>>>
>>>>>>> I can checkpoint some codes (like xhpl) just fine across 8 mpi
>>>>>>> tasks
>>>>>>> ( t
>>>>>>> nodes), dumping 254M total. Thus, the very long/stuck
>>>>>>> checkpointing
>>>>>>> seems rather application dependent.
>>>>>>>
>>>>>>> Here's how I configured openmpi
>>>>>>>
>>>>>>> ./configure --prefix=/nfs/ds01/support/sharon/
>>>>>>> openmpi-1.3a1r18241
>>>>>>> --enable-mpi-threads --enable-ft-thread --with-ft=cr --enable-
>>>>>>> shared
>>>>>>> --enable-mpi-threads=posix --enable-libgcj-multifile
>>>>>>> --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk
>>>>>>> --with-mvapi=/usr/mellanox --with-blcr=/opt/blcr
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for any further insights you may have.
>>>>>>> Sharon
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users