Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi-restart failed and process migration
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2012-04-23 10:51:32


I wonder if the LD_LIBRARY_PATH is not being set properly upon
restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'.
ompi-restart will not pass that variable along for you, so if you are
using that to set the BLCR path this might be your problem.

A couple solutions:
 - have the PATH and LD_LIBRARY_PATH set the same on all nodes
 - have ompi-restart pass the -x parameter to the underlying mpirun by
using the -mpirun_opts command line switch:
   ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ...

Yes. ompi-restart will let you checkpoint a process on one node and
restart it on another. You will have to restart the whole application
since the ompi-migration operation is not available in the 1.5 series.

-- Josh

On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860103_at_[hidden]> wrote:
> Hi all,
> I have Some problems,I wana check/Restart Multiple process on 2 node.
>
> My environment:
> BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04
> I have 2 Node :
>  N05(Master ,it have NFS shared file system),N07(slave
> ,mount Master-Node).
>
> My configure format=./configure --prefix=/root/kidd_openMPI
> --with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR
> --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default
> --enable-static --enable-shared --enable-opal-multi-threads;
>
> I had also set  ~/.openmpi/mca-params.conf->
>     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints.
>
> the dir->kidd_openMPI is my nfs shared dir.
>
>  My Command :
> 1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c
>
>  2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH
> -np 2 ./TEST .
>
> I can restart process-0 on Master,but process-1 on N07 was failed.
>
> I checked my Node,it does not install the prelink,
> so the error(restart-failed) is caused by other reasons.
>
> Error Message-->
> --------------------------------------------------------------------------
>  root_at_cuda05:~/kidd_openMPI/checkpoints#
> ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/
> --------------------------------------------------------------------------
>    Error: BLCR was not able to restart the process because exec failed.
>     Check the installation of BLCR on all of the machines in your
>    system. The following information may be of help:
>  Return Code : -1
>  BLCR Restart Command : cr_restart
>  Restart Command Line : cr_restart
> /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/
> opal_snapshot_1.ckpt/ompi_blcr_context.2704
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Error: Unable to obtain the proper restart command to restart from the
>        checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>        Check the installation of the blcr checkpoint/restart service
>        on all of the machines in your system.
> ###########################################################################
> problem 2: I wana let MPI-process can migration to another Node.
> if Ompi-Restart  Multiple-Node can be successful.
> Can restart in another new node, rather than the original node?
> example:
> checkpoint (node1,node2,node3),then restart(node1,node3,node4).
> or just restart(node1,node3(2-process) ).
>
>    Please help me , thanks .
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey