Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-restart issue : ompi-restart doesn't work across nodes - possible installation problem or environment setting problem??
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-10-06 18:44:30


The installation looks ok, though I'm not sure what is causing the
segfault of the restarted process. Two things to try. First can you
send me a backtrace from the core file that is generated from the
segmentation fault. That will provide insight into what is causing it.

Second you may try to enable the C/R thread which allows for a
checkpoint to progress when an application is in a computation loop
instead of only when it is in the MPI library. To do so configure with
these additional flags:
   --enable-ft-thread --enable-mpi-threads

What version of Open MPI are you using? What version of BLCR?

Best,
Josh

On Oct 6, 2008, at 3:55 PM, arun dhakne wrote:

> Hi all,
>
> This is the procedure i have followed to install openmpi. Is there
> some installation or environment setting problem in here?
> an openmpi program with 4 process is run across 2 dual-core intel
> machines, with 2 processes running on each of the machine.
>
> ompi-checkpoint is successful but ompi-restart fails with following
> error
>
>
> $:> ompi-restart ompi_global_snapshot_6045.ckpt
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 6372 on node
> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
> fault).
> --------------------------------------------------------------------------
>
> Open-mpi installation steps:
> ./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr
> --with-blcr=/usr/lib64 --enable-debug
> make
> make install
>
>
>
> export LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/
> openmpi:/usr/lib64
> export PATH=$HOME/.openmpi/bin:$PATH
>
> NOTE: blcr is installed as a module
> $:> lsmod | grep blcr
>
> blcr 117892 0
> blcr_vmadump 58264 1 blcr
> blcr_imports 46080 2 blcr,blcr_vmadump
>
> Please let me know if there is problem with above procedure, thanks a
> lot for your time.
>
> Best.
>
> ---------- Forwarded message ----------
> From: arun dhakne <arundhakne_at_[hidden]>
> Date: Tue, Sep 30, 2008 at 12:52 AM
> Subject: ompi-restart issue : ompi-restart doesn't work across nodes
> To: Open MPI Users <users_at_[hidden]>
>
>
> Hi all,
>
> I had gone through some previous ompi-restart issues but i couldn't
> find anything similar to this problem.
>
> I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645'
>
> i) If the sample mpi program say ( np 4 on single machine that is
> without any hostfile )is ran and I try to checkpoint it, it happens
> successfully and even ompi-restart works in this case.
>
> ii) If the sample mpi program is ran across say 2 different nodes and
> checkpoint happens successfully BUT ompi-restart throws following
> error:
>
> $ ompi-restart ompi_global_snapshot_7604.ckpt
> --------------------------------------------------------------------------
> mpirun noticed that process rank 3 with PID 9590 on node
> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
> fault).
> --------------------------------------------------------------------------
>
> Please let me know if more information is needed.
>
> --
> Thanks and Regards,
> Arun U. Dhakne
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users