Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] change hosts to restart the checkpoint
From: Fernando Lemos (fernandotcl_at_[hidden])
Date: 2010-03-07 09:06:00

On Fri, Mar 5, 2010 at 12:03 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:
> This type of failure is usually due to prelink'ing being left enabled on one
> or more of the systems. This has come up multiple times on the Open MPI
> list, but is actually a problem between BLCR and the Linux kernel. BLCR has
> a FAQ entry on this that you will want to check out:
> If that does not work, then we can look into other causes.

I also suggest checkpointing and restarting the app with BLCR
directly. I.e., take any simple app, run it with cr_run, checkpoint it
with cr_checkpoint then restart it with cr_restart. Make sure the blcr
module is loaded too. That way you can tell whether it's related to
OpenMPI or not.