Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
From: Hideyuki Jitsumoto (jitumoto_at_[hidden])
Date: 2010-05-18 23:48:01


Hi Josh,

Thank you for your replying.
I tried to patch a Ticket #2139 to openmpi-1.4.1
and to install all of the elements from the very beginning.
Then I got a correct work.
Probably there are some faults on my environment preparation.

# I cannot reproduce the environment when I got failure.
# I'm very sorry that I cannot find truly factors of this malfunction
# and cannot send any information.
# Now I use openmpi-1.4.2, it works well without any patch. (except
for ompi_info)

>> In addition, when I confirmed open_info output as your demo movie, I got
>> "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output)
>
> This is actually a known bug with ompi_info. I have a fix in the works for
> it, and should be available soon. Until then the ticket is linked below:
> https://svn.open-mpi.org/trac/ompi/ticket/2097
Thank you, I'll try it.

On Wed, May 19, 2010 at 3:46 AM, Josh Hursey <jjhursey_at_[hidden]> wrote:
> (Sorry for the delay in replying, more below)
>
> On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote:
>
>> Hi Members,
>>
>> I tried to use checkpoint/restart by openmpi.
>> But I can not get collect checkpoint data.
>> I prepared execution environment as follows, the strings in () mean
>> name of output file which attached on next e-mail ( for mail size
>> limitation ):
>>
>> 1. installed BLCR and checked BLCR is working correctly by "make check"
>> 2. executed ./configure with some parameters on openMPI source dir
>> (config.output / config.log)
>> 3. executed make and make install (make.output.2 / install.output.2)
>> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
>> /${INSTALL_DIR}/lib/openmpi
>> 5. make ~/.openmpi/mca-params.conf (mca-params.conf)
>> 6. compiled NPB and executed with -am ft-enable-cr
>> 7. invoked ompi-checkpoint <MPIRUN_PID>
>>
>> As result, I got the message "Checkpoint failed: no processes
>> checkpointed."
>> (cr_test_cg)
>
> It is unclear from the output what caused the checkpoint to fail. Can you
> turn on some verbose arguments and send me the output?
>
> Put the following options in you ~/.openmpi/mca-params.conf:
> #---------------
> orte_debug_daemons=1
> snapc_full_verbose=20
> crs_base_verbose=10
> opal_cr_verbose=10
> #---------------
>
>
>>
>> In addition, when I confirmed open_info output as your demo movie, I got
>> "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output)
>
> This is actually a known bug with ompi_info. I have a fix in the works for
> it, and should be available soon. Until then the ticket is linked below:
>  https://svn.open-mpi.org/trac/ompi/ticket/2097
>
>>
>> How should I do for checkpointing ?
>> Any guidance in this regard would be highly appreciated.
>
> Let's see what the verbose output tells us, and go from there. What version
> of BLCR are you using?
>
> -- Josh
>
>>
>> Thank you,
>> Hideyuki
>>
>> --
>> Sincerely Yours,
>> Hideyuki Jitsumoto (jitumoto_at_[hidden])
>> Tokyo Institute of Technology
>> Global Scientific Information and Computing center (Matsuoka Lab.)
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Sincerely Yours,
Hideyuki Jitsumoto (jitumoto_at_[hidden])
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)