Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
From: fengguang tian (fernyabc_at_[hidden])
Date: 2010-03-23 11:27:31


I have created the shared file system. but I created a /mirror at root
directory,not at the $HOME directory,is that the
problem? thank you

cheers
fengguang

On Tue, Mar 23, 2010 at 10:23 AM, Fernando Lemos <fernandotcl_at_[hidden]>wrote:

> On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian <fernyabc_at_[hidden]>
> wrote:
> > I set up a cluster of 18 nodes using Open MPI and BLCR library, and the
> MPI
> > program runs well on the clusters,
> > but how to checkpoint the MPI program on this clusters?
> > for example:
> > here is what I do for a test:
> > mpiu_at_nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am
> ft-enable-cr
> > hellompi
> > the program will run on the clusters
> > then ,I enter:
> > mpiu_at_nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun)
> >
> > but the MPI program are not terminated as what happaned on single
> > machine,although it created a checkpoint file“ompi_global_snapshot_
> > 14030.ckpt“ in the home directory on master node.
>
> Are you using OpenMPI 1.4 without a shared file system mounted at
> $HOME? If yes, then take a look here:
>
> http://www.open-mpi.org/community/lists/users/2010/03/12246.php
>
> Regards,
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>