Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problems with checkpointing an mpi job
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-11-06 08:53:56


On Oct 30, 2009, at 1:35 PM, Hui Jin wrote:

> Hi All,
> I got a problem when trying to checkpoint a mpi job.
> I will really appreciate if you can help me fix the problem.
> the blcr package was installed successfully on the cluster.
> I configure the ompenmpi with flags,
> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads --
> with-blcr=/usr/local --with-blcr-libdir=/usr/local/lib/
> The installation looks correct. The open MPI version is 1.3.3
>
> I got the following output when issueing ompi_info:
>
> root_at_hec:/export/home/hjin/test# ompi_info | grep ft
> MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.3.3)
> root_at_hec:/export/home/hjin/test# ompi_info | grep crs
> MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3)
> It seems the MCA crs is lost but I have no idea about how to get it.

This is an artifact of the way ompi_info searches for components. This
came up before on the users list:
   http://www.open-mpi.org/community/lists/users/2009/09/10667.php

I filed a bug about this, if you want to track its progress:
   https://svn.open-mpi.org/trac/ompi/ticket/2097

>
> To run a checkpointable application, I run:
> mpirun -np 2 --host hec -am ft-enable-cr test_mpi
>
> however, when trying to checkpoint at another terminal of the same
> host, I have the following,
> root_at_hec:~# ompi-checkpoint -v 29234
> [hec:29243] orte_checkpoint: Checkpointing...
> [hec:29243] PID 29234
> [hec:29243] Connected to Mpirun [[46621,0],0]
> [hec:29243] orte_checkpoint: notify_hnp: Contact Head Node Process
> PID 29234
> [hec:29243] orte_checkpoint: notify_hnp: Requested a checkpoint of
> jobid [INVALID]
> [hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
> [hec:29243] orte_checkpoint: hnp_receiver: Status Update.
> [hec:29243] Requested - Global Snapshot Reference:
> (null)
> [hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
> [hec:29243] orte_checkpoint: hnp_receiver: Status Update.
> [hec:29243] Pending - Global Snapshot Reference:
> (null)
> [hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
> [hec:29243] orte_checkpoint: hnp_receiver: Status Update.
> [hec:29243] Running - Global Snapshot Reference:
> (null)
>
> There is some error msg at the terminal of the running applicaiton,
> as,
> --------------------------------------------------------------------------
> Error: The process with PID 29236 is not checkpointable.
> This could be due to one of the following:
> - An application with this PID doesn't currently exist
> - The application with this PID isn't checkpointable
> - The application with this PID isn't an OPAL application.
> We were looking for the named files:
> /tmp/opal_cr_prog_write.29236
> /tmp/opal_cr_prog_read.29236
> --------------------------------------------------------------------------
> [hec:29234] local) Error: Unable to initiate the handshake with peer
> [[46621,1],1]. -1
> [hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file
> snapc_full_global.c at line 567
> [hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file
> snapc_full_global.c at line 1054

This means that either the MPI application did not respond to the
checkpoint request in time, or that the application was not
checkpointable for some other reason.

Some options to try:
  - Set the 'snapc_full_max_wait_time' MCA parameter to say 60, the
default is 20 seconds before giving up. You can also set it to 0,
which indicates to the runtime to wait indefinitely.
    shell$ mpirun -mca snapc_full_max_wait_time 60
  - Try cleaning out the /tmp directory on all of the nodes, maybe
this has something to do with disks being full (though usually we
would see other symptoms).

If that doesn't help, can you send me the config.log from your build
of Open MPI. If those do not work, I would suspect that something in
the configure of Open MPI might have gone wrong.

-- Josh

>
>
>
>
> does anyone have some hint to fix this problem?
>
> Thanks,
> Hui Jin
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users