Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] error in checkpointing in open mpi
From: Mallikarjuna Shastry (pmmshastry_at_[hidden])
Date: 2009-09-25 07:10:52


dear sir
 
 i am sending the details as follows
 
 
 1. i am using openmpi-1.3.3 and blcr 0.8.2
 2. i have installed blcr 0.8.2 first under /root/MS
 3. then i installed openmpi 1.3.3 under /root/MS
 4 i have configured and installed open mpi as follows
 
 #./configure --with-ft=cr --enable-mpi-threads
 --with-blcr=/usr/local/bin
 --with-blcr-libdir=/usr/local/lib
 # make
 # make install
 
 then i added the following to the .bash_profile under home
 directory( i went to home directory by doing cd ~)
 
  /sbin/insmod
 /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr_imports.ko
  /sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr.ko
  PATH=$PATH:/usr/local/bin
  MANPATH=$MANPATH:/usr/local/man
  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
 
 then i compiled and run the file arr_add.c as follows
 
 [root_at_localhost examples]# mpicc -o res arr_add.c
 [root_at_localhost examples]# mpirun -np 2 -am ft-enable-cr
 ./res
 
 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2
 --------------------------------------------------------------------------
 Error: The process with PID 5790 is not checkpointable.
        This could be due to one of
> the following:
> - An application with this PID
> doesn't currently exist
> - The application with this PID
> isn't checkpointable
> - The application with this PID
> isn't an OPAL application.
> We were looking for the
> named files:
>
> /tmp/opal_cr_prog_write.5790
>
> /tmp/opal_cr_prog_read.5790
> --------------------------------------------------------------------------
> [localhost.localdomain:05788] local) Error: Unable to
> initiate the handshake with peer [[7788,1],1]. -1
> [localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG:
> Error in file snapc_full_global.c at line 567
> [localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG:
> Error in file snapc_full_global.c at line 1054
 

2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2

 
 NOTE: the PID of mpirun is 5788
 
 i geve the following command for taking the checkpoint
 
 [root_at_localhost examples]#ompi-checkpoint -s 5788
 
 i got the following output , but it was hanging like this
 
 [localhost.localdomain:05796]
          Requested - Global
 Snapshot Reference: (null)
 [localhost.localdomain:05796]
            Pending -
 Global Snapshot Reference: (null)
 [localhost.localdomain:05796]
            Running -
 Global Snapshot Reference: (null)
 
 
 
 kindly rectify it.
 
 with regards
 
 mallikarjuna shastry