Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segfault when resuming on different host
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-12-29 16:31:30


Often this type of problem is due to the 'prelink' option in Linux.
BLCR has a FAQ item that discusses this issue and how to resolve it:
  https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink

I would give that a try. If that does not help then you might want to
try checkpointing a single (non-MPI) process on one node with BLCR and
restart it on the other node. If that fails, then it is likely a
BLCR/system configuration issue that is the cause. If it does work,
then we can dig more into the Open MPI causes.

Let me know if disabling prelink works for you.

-- Josh

On Thu, Dec 29, 2011 at 1:19 PM, Lloyd Brown <lloyd_brown_at_[hidden]> wrote:
> Hi, all.
>
> I'm in the middle of testing some of the checkpoint/restart capabilities
> of OpenMPI with BLCR on our cluster.  I've been able to checkpoint and
> restart successfully when I restart on the same nodes as it was running
> previously.  But when I try to restart on a different host, I always get
> an error like this:
>
>> $ ompi-restart ompi_global_snapshot_15935.ckpt
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 15201 on node m5stage-1-2.local exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>
>
> Now, it's very possible that I've missed something during the setup, or
> that despite my failure to find it while searching the mailing list,
> that this is already answered somewhere, but none of the threads I could
> find seemed to apply (eg. cr_restart *is* installed, etc.).
>
> I'm attaching a tarball that contains the source code of the very-simple
> test application, as well as some example output of "ompi_info --all"
> and "ompi_info -v ompi full --parsable".  I don't know if this will be
> useful or not.
>
> This is being tested on CentOS v5.4 with BLCR v0.8.4.  I've seen this
> problem with OpenMPI v1.4.2, v1.4.4, and v1.5.4.
>
> If anyone has any ideas on what's going on, or how to best debug this,
> I'd love to hear about it.
>
> I don't mind doing the legwork too, but I'm just stumped where to go
> from here.  I have some core files, but I'm having trouble getting the
> symbols from the backtrace in gdb.  Maybe I'm doing it wrong.
>
>
> TIA,
>
> --
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey