Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpointing multi node and multi process applications
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-03-04 10:57:04


On Mar 4, 2010, at 8:17 AM, Fernando Lemos wrote:

> On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos <fernandotcl_at_[hidden]> wrote:
> <snip>
>> Is there anything I can do to provide more information about this bug?
>> E.g. try to compile the code in the SVN trunk? I also have kept the
>> snapshots intact, I can tar them up and upload them somewhere in case
>> you guys need it. I can also provide the source code to the ring
>> program, but it's really the canonical ring MPI example.
>>
>
> I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags).
> This time taking the checkpoint didn't generate any error message:
>
> root_at_debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1
> -np 2 --host debian1,debian2 ring
> <snip>
>>>> Process 1 sending 2761 to 0
>>>> Process 1 received 2760
>>>> Process 1 sending 2760 to 0
> root_at_debian1:~#
>
> But restoring it did:
>
> root_at_debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt
> [debian1:23129] Error: Unable to access the path
> [/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]!
> --------------------------------------------------------------------------
> Error: The filename (opal_snapshot_1.ckpt) is invalid because either
> you have not provided a filename
> or provided an invalid filename.
> Please see --help for usage.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 1 with PID 23129 on
> node debian1 exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> root_at_debian1:~#
>
> Indeed, opal_snapshot_1.ckpt does not exist exist:
>
> root_at_debian1:~# find ompi_global_snapshot_23071.ckpt/
> ompi_global_snapshot_23071.ckpt/
> ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data
> ompi_global_snapshot_23071.ckpt/restart-appfile
> ompi_global_snapshot_23071.ckpt/0
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data
> root_at_debian1:~#
>
> It can be found in debian2:
>
> root_at_debian2:~# find ompi_global_snapshot_23071.ckpt/
> ompi_global_snapshot_23071.ckpt/
> ompi_global_snapshot_23071.ckpt/0
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data
> ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501
> root_at_debian2:~#

By default, Open MPI requires a shared file system to save checkpoint files. So by default the local snapshot is moved, since the system assumes that it is writing to the same directory on a shared file system. If you want to use the local disk staging functionality (which is known to be broken in the 1.4 series), check out the example on the webpage below:
  http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local

>
> Then I tried supplying a hostfile for ompi-run and it worked just
> fine! I thought the checkpoint included the hosts information?

We intentionally do not save the hostfile as part of the checkpoint. Typically folks will want to restart on different nodes than those they checkpointed on (such as in a batch scheduling environment). If we saved the hostfile then it could lead to unexpected user behavior on restart if the machines that they wish to restart on change.

If you need to pass a hostfile, the you can pass one to ompi-restart just as you would mpirun.

>
> So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN?

The file staging functionality is known to be broken in the 1.4 series at this time, per the ticket below:
  https://svn.open-mpi.org/trac/ompi/ticket/2139

Unfortunately the fix is likely to be both custom for the branch (since we redesigned the functionality for the trunk and v1.5) and fairly involved. I don't have the time at the moment to work on fix, but hopefully in the coming months I will be able to look into this issue. In the mean time, patches are always welcome :)

Hope that helps,
Josh

>
>
> Thanks a bunch,
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users