(Sorry for the delay, I missed the C/R question in the mail)
On May 25, 2010, at 9:35 AM, Jeff Squyres wrote:
> On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote:
>> | > 2) I have installed blcr V0.8.2 but when I try to built OMPI
>> and I point to the
>> | > full installation it complains it cannot find it. Note that I
>> build BLCR with
>> | > GCC but I am building OMPI with Intel compilers (V11.1)
>> | Can you be more specific here?
>> I pointed to the insatllation path for BLCR but config complained
>> that it
>> couldn't find it. If BLCR is only needed for checkpoint / restart
>> then we can
>> leave without it. Is BLCR needed for suspend/resume of mpi jobs ?
> You mean suspend with ctrl-Z? If so, correct -- BLCR is *only* used
> for checkpoint/restart. Ctrl-Z just uses the SIGSTP functionality.
So BLCR is used for the checkpoint/restart functionality in Open MPI.
We have a webpage with some more details and examples at the link below:
You should be able to suspend/resume an Open MPI job using SIGSTOP/
SIGCONT without the C/R functionality. We have FAQ item that talks
about how to enable this functionality:
You can combine the C/R and the SIGSTOP/SIGCONT functionality so that
when you 'suspend' a job a checkpoint is taken and the process is
stopped. You can continue the job by sending SIGCONT as normal.
Additionally, this way if the job needs to be terminated for some
reason (e.g., memory footprint, maintenance), it can be safely
terminated and restarted from the checkpoint. I have a example of how
this works at the link below:
As far as C/R integration with schedulers/resource managers, I know
that the BLCR folks have been working with Torque to better integrate
Open MPI+BLCR+Torque. If this is of interest, you might want to check
with them on the progress of that project.