Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segfault when resuming on different host
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-12-29 16:31:30

Often this type of problem is due to the 'prelink' option in Linux.
BLCR has a FAQ item that discusses this issue and how to resolve it:

I would give that a try. If that does not help then you might want to
try checkpointing a single (non-MPI) process on one node with BLCR and
restart it on the other node. If that fails, then it is likely a
BLCR/system configuration issue that is the cause. If it does work,
then we can dig more into the Open MPI causes.

Let me know if disabling prelink works for you.

-- Josh

On Thu, Dec 29, 2011 at 1:19 PM, Lloyd Brown <lloyd_brown_at_[hidden]> wrote:
> Hi, all.
> I'm in the middle of testing some of the checkpoint/restart capabilities
> of OpenMPI with BLCR on our cluster.  I've been able to checkpoint and
> restart successfully when I restart on the same nodes as it was running
> previously.  But when I try to restart on a different host, I always get
> an error like this:
>> $ ompi-restart ompi_global_snapshot_15935.ckpt
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 15201 on node m5stage-1-2.local exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
> Now, it's very possible that I've missed something during the setup, or
> that despite my failure to find it while searching the mailing list,
> that this is already answered somewhere, but none of the threads I could
> find seemed to apply (eg. cr_restart *is* installed, etc.).
> I'm attaching a tarball that contains the source code of the very-simple
> test application, as well as some example output of "ompi_info --all"
> and "ompi_info -v ompi full --parsable".  I don't know if this will be
> useful or not.
> This is being tested on CentOS v5.4 with BLCR v0.8.4.  I've seen this
> problem with OpenMPI v1.4.2, v1.4.4, and v1.5.4.
> If anyone has any ideas on what's going on, or how to best debug this,
> I'd love to hear about it.
> I don't mind doing the legwork too, but I'm just stumped where to go
> from here.  I have some core files, but I'm having trouble getting the
> symbols from the backtrace in gdb.  Maybe I'm doing it wrong.
> TIA,
> --
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> _______________________________________________
> users mailing list
> users_at_[hidden]

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory