Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Some questions about checkpoint/restart (6)
From: Takayuki Seki (seki_at_[hidden])
Date: 2010-03-18 05:09:38


6th question is as follows:

(6) About the first_continue_pass static variable in the ft_event functions.

Related frameworks are following.

Framework : bml
Component : r2
The source file : ompi/mca/bml/r2/bml_r2_ft.c
The function name : mca_bml_r2_ft_event

Framework : crcp
Component : bkmrk
The source file : ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
The function name : ompi_crcp_bkmrk_pml_ft_event

Framework : pml
Component : ob1
The source file : ompi/mca/pml/ob1/pml_ob1.c
The function name : mca_pml_ob1_ft_event
Component : csum
The source file : ompi/mca/pml/csum/pml_csum.c
The function name : mca_pml_csum_ft_event

I think the first_continue_pass variable exists to identify
whether mca_pml.pml_ft_event(OPAL_CRS_CONTINUE) has been called at the first time
or at second time in INC-continue section when ompi_cr_continue_like_restart is true.

When mca_pml.pml_ft_event(OPAL_CRS_CONTINUE) is called at the first time,
first_continue_pass variable is true, if it is called by ompi_cr_coord_pre_continue function.

When mca_pml.pml_ft_event(OPAL_CRS_CONTINUE) is called at the second time,
first_continue_pass variable is false, if it is called by ompi_cr_coord_post_continue function,

However, I think that there is a problem, if ompi_cr_continue_like_restart isn't true.

If ompi_cr_continue_like_restart isn't true and when checkpoint is executed in an odd number of times,
INC-continue section is executed under the condition which first_continue_pass is true.

If ompi_cr_continue_like_restart isn't true and when checkpoint is executed in an even number of times,
INC-continue section is executed under the condition which first_continue_pass is false.

Therefor, mca_pml.pml_ft_event(OPAL_CRS_CONTINUE) is called in INC-continue section just once
if ompi_cr_continue_like_restart isn't true.

This behavior is incorrect.
I think that the first_continue_pass be always true if ompi_cr_continue_like_restart isn't true.