Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi-restart failed and process migration
From: kidd (q19860103_at_[hidden])
Date: 2012-04-23 14:45:05


Hi ,Thank you For your reply.   I have some problems: (1) Now ,In the my platform , all nodes have the same pathand LD_LIBRARY_PATH.  I set in .bashrc  /--------------------------------------------------------------------------------/ #BLCR export PATH=$PATH:/usr/local/BLCR/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib #openMPI export PATH=$PATH:/root/kidd_openMPI/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib /-------------------------------------------------------------------------------------------/ but ,when I  running  mpirun  , I have to add  " -x  LD_LIBRARY_PATH" ,or  it can't  run  example:  mpirun -hostfile hosts  -np  2  ./TEST .  Error Message==> ./TEST: error while loading shared libraries: libcr.so.0: cannot open shared object file: No such file or directory  (2)  BLCR need to unify linux-kernel  of all the Node ?        Now ,I reset all  Node.(using Ubuntu 10.04)  (3)       Now , My porgram using  DLL . I implements some DLL  ,MPI-Program calls DLLs .        Ompi can check/Restart  Program contains  DLL ? ________________________________ ________________________________ 寄件者: Josh Hursey <jjhursey_at_[hidden]> 收件者: Open MPI Users <users_at_[hidden]> 寄件日期: 2012/4/23 (週一) 10:51 PM 主旨: Re: [OMPI users] Ompi-restart failed and process migration I wonder if the LD_LIBRARY_PATH is not being set properly upon restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'. ompi-restart will not pass that variable along for you, so if you are using that to set the BLCR path this might be your problem. A couple solutions: - have the PATH and LD_LIBRARY_PATH set the same on all nodes - have ompi-restart pass the -x parameter to the underlying mpirun by using the -mpirun_opts command line switch:   ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ... Yes. ompi-restart will let you checkpoint a process on one node and restart it on another. You will have to restart the whole application since the ompi-migration operation is not available in the 1.5 series. -- Josh On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860103_at_[hidden]> wrote: > Hi all, > I have Some problems,I wana check/Restart Multiple process on 2 node. > >  My environment: >  BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04 > I have 2 Node : >  N05(Master ,it have NFS shared file system),N07(slave >  ,mount Master-Node). > >  My configure format=./configure --prefix=/root/kidd_openMPI >  --with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR >  --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default >  --enable-static --enable-shared --enable-opal-multi-threads; > >  I had also set  ~/.openmpi/mca-params.conf-> >     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp >     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints. > > the dir->kidd_openMPI is my nfs shared dir. > >  My Command : >  1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c > >   2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH >      -np 2 ./TEST . > >  I can restart process-0 on Master,but process-1 on N07 was failed. > >  I checked my Node,it does not install the prelink, >  so the error(restart-failed) is caused by other reasons. > >  Error Message--> >  -------------------------------------------------------------------------- >   root_at_cuda05:~/kidd_openMPI/checkpoints# >  ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/ >  -------------------------------------------------------------------------- >     Error: BLCR was not able to restart the process because exec failed. >      Check the installation of BLCR on all of the machines in your >      system. The following information may be of help: >   Return Code : -1 >   BLCR Restart Command : cr_restart >   Restart Command Line : cr_restart >  /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/ >  opal_snapshot_1.ckpt/ompi_blcr_context.2704 >  -------------------------------------------------------------------------- >  -------------------------------------------------------------------------- >  Error: Unable to obtain the proper restart command to restart from the >         checkpoint file (opal_snapshot_1.ckpt). Returned -1. >         Check the installation of the blcr checkpoint/restart service >         on all of the machines in your system. >  ########################################################################### >  problem 2: I wana let MPI-process can migration to another Node. >          if Ompi-Restart  Multiple-Node can be successful. >          Can restart in another new node, rather than the original node? >                        example: >          checkpoint (node1,node2,node3),then restart(node1,node3,node4). >          or just restart(node1,node3(2-process) ). > >    Please help me , thanks . > > > _______________________________________________ > users mailing list > users_at_[hidden] > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users