Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OPEN-MPI Fault-Tolerance for GASNet
From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2009-12-01 16:49:07


Thomas,

I connection with Josh's question about mpirun arguments, I suggest you
try setting
     MPIRUN_CMD='mpirun -am ft-enable-cr -np %N %P %A'
in your environment before launching the GASNet application. This will
instruct GASNet's wrapper around mpirun to include the flag Josh mentioned.

-Paul

Josh Hursey wrote:
> Thomas,
>
> I have not tried to use the checkpoint/restart feature with GASNet
> over MPI, so I cannot comment directly on how they interact. However,
> the combination should work as long as the proper arguments (-am
> ft-enable-cr) are passed along to the mpirun command, and Open MPI is
> configured properly.
>
> The error message that you copied seems to indicate that the local
> daemon on one of the nodes failed to start a checkpoint of the target
> application. Often this is caused by one of two things:
> - Open MPI was not configured with the fault tolerance thread, and
> the application is waiting for a long time in a computation loop (not
> entering the MPI library).
> - The '-am ft-enable-cr' flag was not provided to the mpirun process,
> so the MPI application did not activate the C/R specific code paths
> and is therefore denying the request to checkpoint.
>
> Can you send me a bit more information:
> - What version of Open MPI are you using?
> - How did you configure Open MPI?
> - What arguments are being passed to 'mpirun' when running with GASNet?
> - Do you have any environment variables/MCA parameters set for Open MPI?
>
> -- Josh
>
> On Nov 22, 2009, at 7:13 PM, Thomas CI Yoon wrote:
>
>> Dear all.
>>
>> Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the
>> checkpoint/restart function very well for my MPI applications.
>> But its checkpoint does not work for my GASNet applications which use
>> the MPI conduit.
>> Is here anyone else to help me?
>> I wrote some code with GASNet API (Global-Address Space Networking:
>> http://gasnet.cs.berkeley.edu/) and used MPI conduit for my gasnet
>> application, so my program ran well with open-mpirun. Thus I thought
>> that I could also use the transparent checkpoint/restart function
>> supported by BLCR in Open-mpi. As opposed to my idea, it does not
>> work and show the following error message.
>> --------------------------------------------------------------------------
>>
>> Error: The process with PID 13896 is not checkpointable.
>> This could be due to one of the following:
>> - An application with this PID doesn't currently exist
>> - The application with this PID isn't checkpointable
>> - The application with this PID isn't an OPAL application.
>> We were looking for the named files:
>> /tmp/opal_cr_prog_write.13896
>> /tmp/opal_cr_prog_read.13896
>> --------------------------------------------------------------------------
>>
>> 1 more process has sent help message help-opal-checkpoint.txt
>> Set MCA parameter "orte_base_help_aggregate" to 0 to see all help
>> 0] 13896) Step 53
>> 0] 15100) Step 53
>> 0] 13896) Step 54
>> 0] 15100) Step 54
>> 0] 13896) Step 55
>>
>> In my application, the MPI_Initialized() says it is initialized.
>>
>> Thank you for your reading and have a great day.
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory