Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi-restart failed and process migration
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2012-04-23 15:23:18


On Mon, Apr 23, 2012 at 2:45 PM, kidd <q19860103_at_[hidden]> wrote:
> Hi ,Thank you For your reply.
>
> I have some problems:
> (1)
> Now ,In the my platform , all nodes have the same path and LD_LIBRARY_PATH.
>  I set in .bashrc
> /--------------------------------------------------------------------------------/
> #BLCR
> export PATH=$PATH:/usr/local/BLCR/bin
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib
> #openMPI
> export PATH=$PATH:/root/kidd_openMPI/bin
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib
> /-------------------------------------------------------------------------------------------/
> but ,when I  running  mpirun  , I have to add  " -x  LD_LIBRARY_PATH" ,or
> it can't  run
>  example:  mpirun -hostfile hosts  -np  2  ./TEST .
>  Error Message==>
> ./TEST: error while loading shared libraries: libcr.so.0: cannot open shared
> object file: No such file or directory

It sounds like something is still not quite right with your
environment and system setup. If you have set the PATH and
LD_LIBRARY_PATH appropriately on all nodes then you should not have to
pass the "-x LD_LIBRARY_PATH" option to mpirun. Additionally, the
error you are seeing is from BLCR. That error seems to indicate that
BLCR is not installed correctly on all nodes.

Some things to look into (in this order):
 1) Make sure that you have BLCR and Open MPI installed in the same
location on all machines.
 2) Make sure that BLCR works on all machines by checkpointing and
restarting a single process program
 3) Make sure that Open MPI works on all machines -without-
checkpointing, and without passing the -x option.
 4) Checkpoint/restart an MPI job

>  (2)  BLCR need to unify linux-kernel  of all the Node ?
>        Now ,I reset all  Node.(using Ubuntu 10.04)

I do not understand what you are trying to ask here. Please rephrase.

>  (3)
>       Now , My porgram using  DLL . I implements some DLL  ,MPI-Program
> calls DLLs .
>       Ompi can check/Restart  Program contains  DLL ?

I do not understand what you are trying to ask here. Please rephrase.

-- Josh

> ________________________________
>
> ________________________________
> 寄件者: Josh Hursey <jjhursey_at_[hidden]>
> 收件者: Open MPI Users <users_at_[hidden]>
> 寄件日期: 2012/4/23 (週一) 10:51 PM
> 主旨: Re: [OMPI users] Ompi-restart failed and process migration
>
> I wonder if the LD_LIBRARY_PATH is not being set properly upon
> restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'.
> ompi-restart will not pass that variable along for you, so if you are
> using that to set the BLCR path this might be your problem.
>
> A couple solutions:
> - have the PATH and LD_LIBRARY_PATH set the same on all nodes
> - have ompi-restart pass the -x parameter to the underlying mpirun by
> using the -mpirun_opts command line switch:
>   ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ...
>
> Yes. ompi-restart will let you checkpoint a process on one node and
> restart it on another. You will have to restart the whole application
> since the ompi-migration operation is not available in the 1.5 series.
>
> -- Josh
>
> On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860103_at_[hidden]> wrote:
>> Hi all,
>> I have Some problems,I wana check/Restart Multiple process on 2 node.
>>
>>  My environment:
>>  BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04
>> I have 2 Node :
>>  N05(Master ,it have NFS shared file system),N07(slave
>>  ,mount Master-Node).
>>
>>  My configure format=./configure --prefix=/root/kidd_openMPI
>>  --with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR
>>  --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default
>>  --enable-static --enable-shared --enable-opal-multi-threads;
>>
>>  I had also set  ~/.openmpi/mca-params.conf->
>>     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>>     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints.
>>
>> the dir->kidd_openMPI is my nfs shared dir.
>>
>>  My Command :
>>  1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c
>>
>>   2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH
>>      -np 2 ./TEST .
>>
>>  I can restart process-0 on Master,but process-1 on N07 was failed.
>>
>>  I checked my Node,it does not install the prelink,
>>  so the error(restart-failed) is caused by other reasons.
>>
>>  Error Message-->
>>
>> --------------------------------------------------------------------------
>>   root_at_cuda05:~/kidd_openMPI/checkpoints#
>>  ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/
>>
>> --------------------------------------------------------------------------
>>     Error: BLCR was not able to restart the process because exec failed.
>>      Check the installation of BLCR on all of the machines in your
>>      system. The following information may be of help:
>>   Return Code : -1
>>   BLCR Restart Command : cr_restart
>>   Restart Command Line : cr_restart
>>  /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/
>>  opal_snapshot_1.ckpt/ompi_blcr_context.2704
>>
>> --------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------
>>  Error: Unable to obtain the proper restart command to restart from the
>>         checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>>         Check the installation of the blcr checkpoint/restart service
>>         on all of the machines in your system.
>>
>> ###########################################################################
>>  problem 2: I wana let MPI-process can migration to another Node.
>>          if Ompi-Restart  Multiple-Node can be successful.
>>          Can restart in another new node, rather than the original node?
>>                        example:
>>          checkpoint (node1,node2,node3),then restart(node1,node3,node4).
>>          or just restart(node1,node3(2-process) ).
>>
>>    Please help me , thanks .
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey