Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] problems with checkpointing an mpi job
From: Hui Jin (hjin6_at_[hidden])
Date: 2009-10-30 15:35:35


Hi All,
I got a problem when trying to checkpoint a mpi job.
I will really appreciate if you can help me fix the problem.
the blcr package was installed successfully on the cluster.
I configure the ompenmpi with flags,
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
--with-blcr=/usr/local --with-blcr-libdir=/usr/local/lib/
The installation looks correct. The open MPI version is 1.3.3

I got the following output when issueing ompi_info:

root_at_hec:/export/home/hjin/test# ompi_info | grep ft
                 MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.3.3)
root_at_hec:/export/home/hjin/test# ompi_info | grep crs
                 MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3)
It seems the MCA crs is lost but I have no idea about how to get it.

To run a checkpointable application, I run:
 mpirun -np 2 --host hec -am ft-enable-cr test_mpi

however, when trying to checkpoint at another terminal of the same host,
I have the following,
root_at_hec:~# ompi-checkpoint -v 29234
[hec:29243] orte_checkpoint: Checkpointing...
[hec:29243] PID 29234
[hec:29243] Connected to Mpirun [[46621,0],0]
[hec:29243] orte_checkpoint: notify_hnp: Contact Head Node Process PID 29234
[hec:29243] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid
[INVALID]
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Requested - Global Snapshot Reference: (null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Pending - Global Snapshot Reference: (null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Running - Global Snapshot Reference: (null)

There is some error msg at the terminal of the running applicaiton, as,
--------------------------------------------------------------------------
Error: The process with PID 29236 is not checkpointable.
       This could be due to one of the following:
        - An application with this PID doesn't currently exist
        - The application with this PID isn't checkpointable
        - The application with this PID isn't an OPAL application.
       We were looking for the named files:
         /tmp/opal_cr_prog_write.29236
         /tmp/opal_cr_prog_read.29236
--------------------------------------------------------------------------
[hec:29234] local) Error: Unable to initiate the handshake with peer
[[46621,1],1]. -1
[hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file
snapc_full_global.c at line 567
[hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file
snapc_full_global.c at line 1054

does anyone have some hint to fix this problem?

Thanks,
Hui Jin