Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segfault when resuming on different host
From: Lloyd Brown (lloyd_brown_at_[hidden])
Date: 2011-12-29 17:25:06


Josh,

When I use cr_{run,checkpoint,restart} to start a checkpoint and restart
a single-threaded, single-process app on a different host, it works,
even with prelinking enabled. That's kinda why I assumed the problem
was with the OpenMPI code, and didn't look at the BLCR FAQ that closely,
to be honest.

Having said that, I did temporarily disable prelink on my two hosts, and
tried my MPI test again, and it seemed to work. I'll have to do more
tests with something more intense (xhpl, maybe), and so on, but
preliminary results look good.

Thanks for pointing me in the right direction.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 12/29/2011 02:31 PM, Josh Hursey wrote:
> Often this type of problem is due to the 'prelink' option in Linux.
> BLCR has a FAQ item that discusses this issue and how to resolve it:
> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink
>
> I would give that a try. If that does not help then you might want to
> try checkpointing a single (non-MPI) process on one node with BLCR and
> restart it on the other node. If that fails, then it is likely a
> BLCR/system configuration issue that is the cause. If it does work,
> then we can dig more into the Open MPI causes.
>
> Let me know if disabling prelink works for you.
>
> -- Josh
>