Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters
From: Fernando Lemos (fernandotcl_at_[hidden])
Date: 2010-03-23 11:37:27


On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian <fernyabc_at_[hidden]> wrote:
> Hi
>
> I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint
> and restart work fine in single machine,but when doing checkpoint in
> clusters environment, the ompi-checkpoint hangs

Besdies what has been said in another thread (regarding 1.4 and
checkpointing to shared directories), you might want to make sure your
app is terminated if you send a SIGTERM to it. Some apps might ignore
SIGTERM or handle it in a way that does not cause the apps to quit.

ompi-checkpoint --term is simply ompi-checkpoint + sending SIGTERM to
the application (not sure whether SIGTERM is sent to each process
individually or not).