Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Restarting from a checkpoint (OMPI 1.3)
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-01-21 09:10:35


Gregor,

Thanks for the bug report. I saw a problem similar to this a few
months ago (documented in the ticket below).
   https://svn.open-mpi.org/trac/ompi/ticket/1527
Though we fixed the accounting information, the patch I had for orte-
restart to switch it away from using --hostfile and instead using --
default-hostfile was never applied to the trunk (my fault here). The
patch is attached if you want to apply it to make sure it fixes the
problem for you.

I have committed the patch to the development trunk (r20305), and
asked that it be brought over to the v1.3 branch so it will be
included in the v1.3.1 release. If you want to track its progress you
can using the ticket below.
   https://svn.open-mpi.org/trac/ompi/ticket/1761

Thanks again,
Josh


On Jan 20, 2009, at 5:07 AM, Gregor Dschung wrote:

> Hey,
>
> I'm trying the new released Open MPI 1.3 in conjunction with BLCR to
> provide the checkpoint/restart-feature.
>
> Configured with ./configure --prefix=/usr/local --with-ft=cr
> --enable-ft-thread --enable-mpi-threads --with-blcr=/
>
> A MPI-job on a single machine (several threads) is checkpointed and
> restarted very well.
>
> The checkpoint of a MPI-job across two hosts (ethernet, tcp) is also
> done without warnings or errors (the homedir and the directory, where
> the MPI-Application is, are shared with NFS). The restart works too,
> but
> all threads are only started on the host, where I enter the ompi-
> restart
> command. Even if I add the -hostfile argument to ompi-restart, only
> the
> one host is used.
>
> Does anybody has a hint?
>
> Thanks,
> Gregor
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users