Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-restart failed
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-07-16 17:42:25


Open MPI can restart multi-threaded applications on any number of nodes (I do this routinely in testing).

If you are still experiencing this problem (sorry for the late reply), can you send me the MCA parameters that you are using, command line, and a backtrace from the corefile generated by the application?

Those bits of information will help me narrow down what might be going wrong. You might also try testing against the v1.5 series or the development trunk to make sure that the problem is not just v1.4 specific.

-- Josh

On Jun 14, 2010, at 2:47 AM, Nguyen Toan wrote:

> Hi all,
> I finally figured out the answer. I just put the parameter "-machinefile host" in the "ompi-restart" command and it restarted correctly. So is it unable to restart multi-threaded application on 1 node in OpenMPI?
>
> Nguyen Toan
>
> On Tue, Jun 8, 2010 at 12:07 AM, Nguyen Toan <nguyentoan1508_at_[hidden]> wrote:
> Sorry, I just want to add 2 more things:
> + I tried configure with and without --enable-ft-thread but nothing changed
> + I also applied this patch for OpenMPI here and reinstalled but I got the same error
> https://svn.open-mpi.org/trac/ompi/raw-attachment/ticket/2139/v1.4-preload-part1.diff
>
> Somebody helps? Thank you very much.
>
> Nguyen Toan
>
>
> On Mon, Jun 7, 2010 at 11:51 PM, Nguyen Toan <nguyentoan1508_at_[hidden]> wrote:
> Hello everyone,
>
> I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes but it failed to restart (Segmentation fault).
> Here are the details concerning my problem:
>
> + OS: Centos 5.4
> + OpenMPI configure:
> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
> --with-blcr=/home/nguyen/opt/blcr --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> --prefix=/home/nguyen/opt/openmpi \
> --enable-mpirun-prefix-by-default
> + mpirun -am ft-enable-cr -machinefile host ./test
>
> I checkpointed the test program using "ompi-checkpoint -v -s PID" and the checkpoint file was created successfully. However it failed to restart using ompi-restart:
> "mpirun noticed that process rank 0 with PID 21242 on node rc014.local exited on signal 11 (Segmentation fault)"
>
> Did I miss something in the installation of OpenMPI?
>
> Regards,
> Nguyen Toan
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users