Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Ompi-restart failed and process migration
From: kidd (q19860103_at_[hidden])
Date: 2012-04-21 04:11:59


Hi all, I have Some problems,I wana check/Restart Multiple process on 2 node. My environment: BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04 I have 2 Node :  N05(Master ,it have NFS shared file system),N07(slave ,mount Master-Node).   My configure format=./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR  --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default --enable-static --enable-shared --enable-opal-multi-threads; I had also set  ~/.openmpi/mca-params.conf->     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints. the dir->kidd_openMPI is my nfs shared dir.  My Command : 1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c  2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH -np 2 ./TEST .   I can restart process-0 on Master,but process-1 on N07 was failed. I checked my Node,it does not install the prelink,so the error(restart-failed) is caused by other reasons. Error Message--> --------------------------------------------------------------------------  root_at_cuda05:~/kidd_openMPI/checkpoints# ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/ --------------------------------------------------------------------------    Error: BLCR was not able to restart the process because exec failed.     Check the installation of BLCR on all of the machines in your    system. The following information may be of help:  Return Code : -1  BLCR Restart Command : cr_restart  Restart Command Line : cr_restart /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/ opal_snapshot_1.ckpt/ompi_blcr_context.2704 -------------------------------------------------------------------------- -------------------------------------------------------------------------- Error: Unable to obtain the proper restart command to restart from the        checkpoint file (opal_snapshot_1.ckpt). Returned -1.        Check the installation of the blcr checkpoint/restart service        on all of the machines in your system. ########################################################################### problem 2: I wana let MPI-process can migration to another Node. if Ompi-Restart  Multiple-Node can be successful. Can restart in another new node, rather than the original node? example: checkpoint (node1,node2,node3),then restart(node1,node3,node4). or just restart(node1,node3(2-process) ).    Please help me , thanks .