Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t
From: Michael E. Thomadakis (miket7777_at_[hidden])
Date: 2010-05-26 15:21:16


Hi Josh

thanks for the reply. pls see below ...

On 05/26/10 09:24, Josh Hursey wrote:
> (Sorry for the delay, I missed the C/R question in the mail)
>
> On May 25, 2010, at 9:35 AM, Jeff Squyres wrote:
>
>> On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote:
>>
>>> | > 2) I have installed blcr V0.8.2 but when I try to built OMPI and
>>> I point to the
>>> | > full installation it complains it cannot find it. Note that I
>>> build BLCR with
>>> | > GCC but I am building OMPI with Intel compilers (V11.1)
>>> |
>>> | Can you be more specific here?
>>>
>>> I pointed to the insatllation path for BLCR but config complained
>>> that it
>>> couldn't find it. If BLCR is only needed for checkpoint / restart
>>> then we can
>>> leave without it. Is BLCR needed for suspend/resume of mpi jobs ?
>>
>> You mean suspend with ctrl-Z? If so, correct -- BLCR is *only* used
>> for checkpoint/restart. Ctrl-Z just uses the SIGSTP functionality.
>
> So BLCR is used for the checkpoint/restart functionality in Open MPI.
> We have a webpage with some more details and examples at the link below:
> http://osl.iu.edu/research/ft/ompi-cr/
>
> You should be able to suspend/resume an Open MPI job using
> SIGSTOP/SIGCONT without the C/R functionality. We have FAQ item that
> talks about how to enable this functionality:
> http://www.open-mpi.org/faq/?category=running#suspend-resume
>
> You can combine the C/R and the SIGSTOP/SIGCONT functionality so that
> when you 'suspend' a job a checkpoint is taken and the process is
> stopped. You can continue the job by sending SIGCONT as normal.
> Additionally, this way if the job needs to be terminated for some
> reason (e.g., memory footprint, maintenance), it can be safely
> terminated and restarted from the checkpoint. I have a example of how
> this works at the link below:
> http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-stop
>
> As far as C/R integration with schedulers/resource managers, I know
> that the BLCR folks have been working with Torque to better integrate
> Open MPI+BLCR+Torque. If this is of interest, you might want to check
> with them on the progress of that project.
>
So suspend/resume of OpenMPI jobs does not require BLCR. OK so I will
proceed w/o it.

best regards,

Michael

> -- Josh
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
% -------------------------------------------------------------------- \
% Michael E. Thomadakis, Ph.D.  Senior Lead Supercomputer Engineer/Res \
% E-mail: miket AT tamu DOT edu                   Texas A&M University \
% web:    http://alphamike.tamu.edu              Supercomputing Center \
% Voice:  979-862-3931                    Teague Research Center, 104B \
% FAX:    979-847-8643                  College Station, TX 77843, USA \
% -------------------------------------------------------------------- \