Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters
From: fengguang tian (fernyabc_at_[hidden])
Date: 2010-03-23 12:25:20

now, I set $HOME as shared directory, but when doing ompi-checkpoint, it
shows:(nimbus1 is the remote machine in
my cluster)

[nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the
sub-directory (/home/mpiu/ompi_global_snapshot_1662.ckpt/0) of
(/home/mpiu/ompi_global_snapshot_1662.ckpt/0/opal_snapshot_4.ckpt), mkdir
failed [1]
[nimbus1:12630] Error: No metadata filename specified!

why is that?


On Tue, Mar 23, 2010 at 10:37 AM, Fernando Lemos <fernandotcl_at_[hidden]>wrote:

> On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian <fernyabc_at_[hidden]>
> wrote:
> > Hi
> >
> > I am using open-mpi and blcr in a cluster of 3 machines, and the
> checkpoint
> > and restart work fine in single machine,but when doing checkpoint in
> > clusters environment, the ompi-checkpoint hangs
> Besdies what has been said in another thread (regarding 1.4 and
> checkpointing to shared directories), you might want to make sure your
> app is terminated if you send a SIGTERM to it. Some apps might ignore
> SIGTERM or handle it in a way that does not cause the apps to quit.
> ompi-checkpoint --term is simply ompi-checkpoint + sending SIGTERM to
> the application (not sure whether SIGTERM is sent to each process
> individually or not).
> _______________________________________________
> users mailing list
> users_at_[hidden]