Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-04-29 15:41:09


Sharon,

This is, unfortunately, to be expected at the moment for this type of
application. Extremely communication intensive applications will most
likely cause the implementation of the current coordination algorithm
to slow down significantly. This is because on a checkpoint Open MPI
does a peerwise check on the description of (possibly) each message to
make sure there are no messages in flight. So for a huge number of
messages this could take a long time.

This is a performance problem with the current implementation of the
algorithm that we use in Open MPI. I've been meaning to go back and
improve this, but it has not been critical to do so since applications
that perform in this manner are outliers in HPC. The coordination
algorithm I'm using is based on the algorithm used by LAM/MPI, but
implemented at a higher level. There are a number of improvements that
I can explore in the checkpoint/restart framework in Open MPI.

If this is critical for you I might be able to take a look at it, but
I can't say when. :(

-- Josh

On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:

>
>
> Josh Hursey wrote:
>> On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
>>
>>> I'm finding that using ompi-checkpoint on an application which is
>>> very cpu bound takes a very very long time. For example, trying to
>>> checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
>>> more than an hour. The problem is not where I'm dumping checkpoints
>>> (I've tried local and an nfs mount with plenty of space, and cpu
>>> intensive apps checkpoint quickly).
>>>
>>> I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.
>>>
>>> Is this condition common and if so, are there possibly mca paramters
>>> which could help?
>>
>> It depends on how you configured Open MPI with checkpoint/restart.
>> There are two modes of operation: No threads, and with a checkpoint
>> thread. They are described a bit more in the Checkpoint/Restart Fault
>> Tolerance User's Guide on the wiki:
>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>
>> By default we compile without the checkpoint thread. The restriction
>> he is that all processes must be in the MPI library in order to make
>> progress on the global checkpoint. For CPU intensive applications
>> this
>> may cause quite a delay in the time to start, and subsequently
>> finish,
>> a checkpoint. I'm guessing that this is what you are seeing.
>>
>> If you configure with the checkpoint thread (add '--enable-mpi-
>> threads-
>> --enable-ft-thread' to ./configure) then Open MPI will create a
>> thread
>> that runs with each application process. This thread is fairly light
>> weight and will make sure that a checkpoint progresses even when the
>> process is not in the Open MPI library.
>>
>> Try enabling the checkpoint thread and see if that helps improve the
>> checkpoint time.
>
> Josh,
> First...please pardon the blunder in my earlier mail. Comms bound apps
> are the ones taking a while to checkpoint, not cpu bound. In any
> case, I
> tried configuring with the above two configure options but still no
> luck
> on improving checkpointing times or gaining completion on larger mpi
> task runs being checkpointed.
>
> It looks like the checkpointing is just hanging. For example, I can
> checkpoint a 2 way comms bound code (1 task on two nodes) ok. When I
> ask
> for a 4 way run on 2 nodes, 30 minutes after the ompi-checkpoint PID
> only see 1 ckpt directory with data in it!
>
>
> /home/sharon/ompi_global_snapshot_25400.ckpt/0
> -bash-2.05b$ ls -l *
> opal_snapshot_0.ckpt:
> total 0
>
> opal_snapshot_1.ckpt:
> total 0
>
> opal_snapshot_2.ckpt:
> total 0
>
> opal_snapshot_3.ckpt:
> total 1868
> -rw------- 1 sharon shc-support 1907476 2008-04-29 10:49
> ompi_blcr_context.1850
> -rw-r--r-- 1 sharon shc-support 33 2008-04-29 10:49
> snapshot_meta.data
> -bash-2.05b$ pwd
>
>
> The file system getting the checkpoints is local. I've tried /scratch
> and others as well.
>
> I can checkpoint some codes (like xhpl) just fine across 8 mpi tasks
> ( t
> nodes), dumping 254M total. Thus, the very long/stuck checkpointing
> seems rather application dependent.
>
> Here's how I configured openmpi
>
> ./configure --prefix=/nfs/ds01/support/sharon/openmpi-1.3a1r18241
> --enable-mpi-threads --enable-ft-thread --with-ft=cr --enable-shared
> --enable-mpi-threads=posix --enable-libgcj-multifile
> --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk
> --with-mvapi=/usr/mellanox --with-blcr=/opt/blcr
>
>
>
> Thanks for any further insights you may have.
> Sharon
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users