Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] OPEN-MPI Fault-Tolerance for GASNet
From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2009-12-01 16:49:07


I connection with Josh's question about mpirun arguments, I suggest you
try setting
     MPIRUN_CMD='mpirun -am ft-enable-cr -np %N %P %A'
in your environment before launching the GASNet application. This will
instruct GASNet's wrapper around mpirun to include the flag Josh mentioned.


Josh Hursey wrote:
> Thomas,
> I have not tried to use the checkpoint/restart feature with GASNet
> over MPI, so I cannot comment directly on how they interact. However,
> the combination should work as long as the proper arguments (-am
> ft-enable-cr) are passed along to the mpirun command, and Open MPI is
> configured properly.
> The error message that you copied seems to indicate that the local
> daemon on one of the nodes failed to start a checkpoint of the target
> application. Often this is caused by one of two things:
> - Open MPI was not configured with the fault tolerance thread, and
> the application is waiting for a long time in a computation loop (not
> entering the MPI library).
> - The '-am ft-enable-cr' flag was not provided to the mpirun process,
> so the MPI application did not activate the C/R specific code paths
> and is therefore denying the request to checkpoint.
> Can you send me a bit more information:
> - What version of Open MPI are you using?
> - How did you configure Open MPI?
> - What arguments are being passed to 'mpirun' when running with GASNet?
> - Do you have any environment variables/MCA parameters set for Open MPI?
> -- Josh
> On Nov 22, 2009, at 7:13 PM, Thomas CI Yoon wrote:
>> Dear all.
>> Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the
>> checkpoint/restart function very well for my MPI applications.
>> But its checkpoint does not work for my GASNet applications which use
>> the MPI conduit.
>> Is here anyone else to help me?
>> I wrote some code with GASNet API (Global-Address Space Networking:
>> and used MPI conduit for my gasnet
>> application, so my program ran well with open-mpirun. Thus I thought
>> that I could also use the transparent checkpoint/restart function
>> supported by BLCR in Open-mpi. As opposed to my idea, it does not
>> work and show the following error message.
>> --------------------------------------------------------------------------
>> Error: The process with PID 13896 is not checkpointable.
>> This could be due to one of the following:
>> - An application with this PID doesn't currently exist
>> - The application with this PID isn't checkpointable
>> - The application with this PID isn't an OPAL application.
>> We were looking for the named files:
>> /tmp/opal_cr_prog_write.13896
>> /tmp/opal_cr_prog_read.13896
>> --------------------------------------------------------------------------
>> 1 more process has sent help message help-opal-checkpoint.txt
>> Set MCA parameter "orte_base_help_aggregate" to 0 to see all help
>> 0] 13896) Step 53
>> 0] 15100) Step 53
>> 0] 13896) Step 54
>> 0] 15100) Step 54
>> 0] 13896) Step 55
>> In my application, the MPI_Initialized() says it is initialized.
>> Thank you for your reading and have a great day.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory