Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] segfault when resuming on different host
From: Lloyd Brown (lloyd_brown_at_[hidden])
Date: 2011-12-29 17:25:06


Josh,

When I use cr_{run,checkpoint,restart} to start a checkpoint and restart
a single-threaded, single-process app on a different host, it works,
even with prelinking enabled. That's kinda why I assumed the problem
was with the OpenMPI code, and didn't look at the BLCR FAQ that closely,
to be honest.

Having said that, I did temporarily disable prelink on my two hosts, and
tried my MPI test again, and it seemed to work. I'll have to do more
tests with something more intense (xhpl, maybe), and so on, but
preliminary results look good.

Thanks for pointing me in the right direction.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 12/29/2011 02:31 PM, Josh Hursey wrote:
> Often this type of problem is due to the 'prelink' option in Linux.
> BLCR has a FAQ item that discusses this issue and how to resolve it:
> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink
>
> I would give that a try. If that does not help then you might want to
> try checkpointing a single (non-MPI) process on one node with BLCR and
> restart it on the other node. If that fails, then it is likely a
> BLCR/system configuration issue that is the cause. If it does work,
> then we can dig more into the Open MPI causes.
>
> Let me know if disabling prelink works for you.
>
> -- Josh
>