Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters
From: fengguang tian (fernyabc_at_[hidden])
Date: 2010-03-23 12:25:20


now, I set $HOME as shared directory, but when doing ompi-checkpoint, it
shows:(nimbus1 is the remote machine in
my cluster)

[nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the
sub-directory (/home/mpiu/ompi_global_snapshot_1662.ckpt/0) of
(/home/mpiu/ompi_global_snapshot_1662.ckpt/0/opal_snapshot_4.ckpt), mkdir
failed [1]
[nimbus1:12630] Error: No metadata filename specified!

why is that?

cheers
fengguang

On Tue, Mar 23, 2010 at 10:37 AM, Fernando Lemos <fernandotcl_at_[hidden]>wrote:

> On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian <fernyabc_at_[hidden]>
> wrote:
> > Hi
> >
> > I am using open-mpi and blcr in a cluster of 3 machines, and the
> checkpoint
> > and restart work fine in single machine,but when doing checkpoint in
> > clusters environment, the ompi-checkpoint hangs
>
> Besdies what has been said in another thread (regarding 1.4 and
> checkpointing to shared directories), you might want to make sure your
> app is terminated if you send a SIGTERM to it. Some apps might ignore
> SIGTERM or handle it in a way that does not cause the apps to quit.
>
> ompi-checkpoint --term is simply ompi-checkpoint + sending SIGTERM to
> the application (not sure whether SIGTERM is sent to each process
> individually or not).
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>