Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OPEN-MPI Fault-Tolerance for GASNet
From: Chang IL Yoon (workciyoon_at_[hidden])
Date: 2010-01-19 19:14:15


Dear Josh

First of all, thank you for your continuous attention on this issue.

About the problem, even though I followed what you had suggested like the
below, the checkpoint did not work.

So append this value to your $HOME/.openmpi/mca-params.conf file
#-----------------
mca_base_param_file_prefix=ft-enable-cr
#-----------------

Sincerely
Thomas

On Mon, Jan 11, 2010 at 2:21 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:

> (Sorry for the delay in replying. I am still sorting through a backlog of
> holiday email buildup).
>
>
> On Dec 10, 2009, at 7:32 PM, Chang IL Yoon wrote:
>
> Dear Josh.
>>
>> Thank your for keeping attention on this problem.
>>
>>
>> On Wed, Dec 9, 2009 at 8:40 AM, Josh Hursey <jjhursey_at_[hidden]>
>> wrote:
>>
>> On Dec 3, 2009, at 2:01 PM, Chang IL Yoon wrote:
>>
>> Dear Josh and Paul.
>>
>> First of all, thank you very much for your interesting on my problem.
>>
>> 1) I tested it again with MPIRUN_CMD as 'mpirun -am ft-enable-cr -np %N
>> %P'
>> But the checkpoint did not work.
>>
>> Is it giving the same error?
>>
>> Can you send me information on how you configured Open MPI on your system?
>>
>> Yes, it gives the same error.
>>
>> When was installing the open-mpi-1.3.3, I used the following
>> configuration.
>>
>> ./configure --enable-ft-thread --with-ft=cr --enable-mpi- threads
>> --with-blcr={BLCR_DIR} --with-blcr-libdir={BLCR_LIBDIR} --
>> prefix={OPENMPI_DIR}
>>
>> What kind of configuration information do you need?
>>
>
> This looks fine to me.
>
>
>
>> 2) Here are the more information on my MPI configuration.
>> - What version of Open MPI are you using?
>> >> I am using Open-MPI ver 1.3.3 with BLCR ver 0.8.2
>>
>> - How did you configure Open MPI?
>> >> ./configure --enable-ft-thread --with-ft=cr --enable-mpi-threads
>> --with-blcr={BLCR_DIR} --with-blcr-libdir={BLCR_LIBDIR}
>> --prefix={OPENMPI_DIR}
>>
>> - What arguments are being passed to 'mpirun' when running with GASNet?
>> >> mpirun -am ft-enable-cr --machinefile ./machinefile -np 1 ./personal
>>
>> The '-np 1' argument is a bit puzzling to me, don't you want this to be >1
>> normally. GASNet does not use any MPI dynamic process management interfaces
>> (e.g., MPI_Comm_spawn), does it?
>>
>> Sorry, actually I do not know if GASNet uses a MPI dynamic process
>> management or not.
>>
>>
> It probably does not (not many applications do), but it could be a problem
> if they do.
>
>
>
>> >> personal is the same probram, my-app.c except for using gasnet_init
>> and gasnet_exit() instead of MPI_Init() and MPI_Finalize().
>> >> my-app.c is in http://osl.iu.edu/research/ft/ompi-cr/examples.php.
>> >> gasnet_init() and gasnet_exit() use MPI_Init() and MPI_Finalize().
>>
>> So you are using the program from the SELF checkpoint example? If Open MPI
>> detects that the application has the appropriate function callbacks to use
>> the SELF CRS (which this example does) then it will -not- use the BLCR
>> component, but instead select the SELF component.
>>
>> Try using a simple counting program instead of that particular example.
>> You could also just remove the opal_crs_self_user_* and my_personal_*
>> functions form the example program to reduce it to one.
>>
>> I'm not sure why the checkpoint would not work even with the SELF CRS.
>> I'll have to check on that.
>>
>> Even though I used a simple counting program, the check point did not
>> work.
>>
>
> Humm... Everything seems to be setup correctly, and the application is
> still behaving like it is not getting the '-am ft-enable-cr' parameter. The
> only other thing I can think of to try is to set this value in the
> $HOME/.openmpi/mca-params.conf file. It looks a bit different but if you add
> the following line it should work (as long as $HOME is mounted on all of the
> machines).
>
> So append this value to your $HOME/.openmpi/mca-params.conf file and see if
> that helps.
> #-----------------
> mca_base_param_file_prefix=ft-enable-cr
> #-----------------
>
> If that doesn't work, I'll have to think a bit more about what might be
> going wrong here.
>
> -- Josh
>
>
>
>> - Do you have any environment variables/MCA parameters set for Open MPI?
>> >> yes
>> $HOME/.openmpi/mca-params.conf
>> # Local snapshot directory (not used in this scenario)
>> crs_base_snapshot_dir=${HOME}/temp
>>
>> # Remote snapshot directory (globally mounted file system))
>> snapc_base_global_snapshot_dir=${HOME}/checkpoints
>>
>> - My network interconnects is Infiniband/OpenIB (IP over IB).
>>
>> These all look fine to me.
>>
>>
>>
>> 3) If there are something for me to solve this problem, please let me know
>> without any hesitation.
>>
>> Thank you again for your reading
>>
>> Sincerely
>>
>>
>> On Tue, Dec 1, 2009 at 1:49 PM, Paul H. Hargrove <PHHargrove_at_[hidden]>
>> wrote:
>> Thomas,
>>
>> I connection with Josh's question about mpirun arguments, I suggest you
>> try setting
>> MPIRUN_CMD='mpirun -am ft-enable-cr -np %N %P %A'
>> in your environment before launching the GASNet application. This will
>> instruct GASNet's wrapper around mpirun to include the flag Josh mentioned.
>>
>> -Paul
>>
>>
>> Josh Hursey wrote:
>> Thomas,
>>
>> I have not tried to use the checkpoint/restart feature with GASNet over
>> MPI, so I cannot comment directly on how they interact. However, the
>> combination should work as long as the proper arguments (-am ft-enable-cr)
>> are passed along to the mpirun command, and Open MPI is configured properly.
>>
>> The error message that you copied seems to indicate that the local daemon
>> on one of the nodes failed to start a checkpoint of the target application.
>> Often this is caused by one of two things:
>> - Open MPI was not configured with the fault tolerance thread, and the
>> application is waiting for a long time in a computation loop (not entering
>> the MPI library).
>> - The '-am ft-enable-cr' flag was not provided to the mpirun process, so
>> the MPI application did not activate the C/R specific code paths and is
>> therefore denying the request to checkpoint.
>>
>> Can you send me a bit more information:
>> - What version of Open MPI are you using?
>> - How did you configure Open MPI?
>> - What arguments are being passed to 'mpirun' when running with GASNet?
>> - Do you have any environment variables/MCA parameters set for Open MPI?
>>
>> -- Josh
>>
>> On Nov 22, 2009, at 7:13 PM, Thomas CI Yoon wrote:
>>
>> Dear all.
>>
>> Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the
>> checkpoint/restart function very well for my MPI applications.
>> But its checkpoint does not work for my GASNet applications which use the
>> MPI conduit.
>> Is here anyone else to help me?
>> I wrote some code with GASNet API (Global-Address Space Networking:
>> http://gasnet.cs.berkeley.edu/) and used MPI conduit for my gasnet
>> application, so my program ran well with open-mpirun. Thus I thought that I
>> could also use the transparent checkpoint/restart function supported by BLCR
>> in Open-mpi. As opposed to my idea, it does not work and show the following
>> error message.
>> --------------------------------------------------------------------------
>> Error: The process with PID 13896 is not checkpointable.
>> This could be due to one of the following:
>> - An application with this PID doesn't currently exist
>> - The application with this PID isn't checkpointable
>> - The application with this PID isn't an OPAL application.
>> We were looking for the named files:
>> /tmp/opal_cr_prog_write.13896
>> /tmp/opal_cr_prog_read.13896
>> --------------------------------------------------------------------------
>> 1 more process has sent help message help-opal-checkpoint.txt
>> Set MCA parameter "orte_base_help_aggregate" to 0 to see all help
>> 0] 13896) Step 53
>> 0] 15100) Step 53
>> 0] 13896) Step 54
>> 0] 15100) Step 54
>> 0] 13896) Step 55
>>
>> In my application, the MPI_Initialized() says it is initialized.
>>
>> Thank you for your reading and have a great day.
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group Tel: +1-510-495-2352
>> HPC Research Department Fax: +1-510-486-6900
>> Lawrence Berkeley National Laboratory
>>
>>
>>
>>
>