Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] FT problem
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-09-21 20:52:00


Not really - the person who wrote that code for his PhD thesis has since become a professor and rarely has time to respond on the mailing list, nor to maintain the code. So I'm afraid we don't have anyone who knows much about it any more.

I plan to rework the checkpoint support in upcoming months, but can't say when that will occur.

On Sep 21, 2013, at 7:51 AM, basma a.azeem <basmaabdelazeem_at_[hidden]> wrote:

> Any Suggestions
>
>
> From: basmaabdelazeem_at_[hidden]
> To: users_at_[hidden]
> Subject: FT problem
> Date: Wed, 18 Sep 2013 16:42:29 +0200
>
> i am using openmpi-1.6.1
> i need to try checkpoint restart ( self , blcr )
> after i installed openmpi i had the following in my installation folder :
>
> bin\ ompi-checkpoint
> bin\ompi-restart
>
> lib\openmpi\mca_crs_self.la
> lib\openmpi\mca_crs_self.so
> lib\openmpi\mca_crs_blcr.la
> lib\openmpi\mca_crs_blcr.so
>
> although i have:
>
> ompi_info | grep FT
> FT Checkpoint support: yes (checkpoint thread: yes)
>
> ompi_info | grep crs
> MCA crs: none (MCA v2.0, API v2.0, Component v1.6.1)
>
> when i try to use checkpoint it failed:
>
> basma_at_basma-Satellite-A500:~$ /OpenMP/openmpi-1.6.1/builddir/bin/mpirun -np 3 -am ft-enable-cr /home/basma/NPB3.3/NPB3.3/NPB3.3-OMP/bin/lu.A
>
>
> NAS Parallel Benchmarks (NPB3.3-OMP) - LU Benchmark
>
> Size: 64x 64x 64
> Iterations: 250
> Number of available threads: 4
>
> NAS Parallel Benchmarks (NPB3.3-OMP) - LU Benchmark
>
> Size: 64x 64x 64
> Iterations: 250
> Number of available threads: 4
>
> NAS Parallel Benchmarks (NPB3.3-OMP) - LU Benchmark
>
> Size: 64x 64x 64
> Iterations: 250
> Number of available threads: 4
>
> Time step 1
> Time step 1
> Time step 1
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 2917 on node basma-Satellite-A500 exited on signal 10 (User defined signal 1).
> --------------------------------------------------------------------------
> basma_at_basma-Satellite-A500:~$
>
> this resulted when i run this command from shell 2 :
> basma_at_basma-Satellite-A500:~$ /OpenMP/openmpi-1.6.1/builddir/bin/ompi-checkpoint 2916
>
> what i did wrong?
>
> thank you
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users