Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Question on MCA_BASE_METADATA_PARAM_NONE
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-08-23 13:20:55

When you configure with '--with-ft=cr' this enables the C/R fault tolerance frameworks, tools and code paths. One code path is the component selection logic you cited below. When you run an application compiled with Open MPI passing the '-am ft-enable-cr' or '-am ft-enable-cr-recovery' options this activates the logic below to pick only those components that have self identified as 'checkpoint ready'. 'checkpoint ready' means different things for different frameworks. Some frameworks do not need to do anything (e.g., timer), while others require much more work (e.g., BTLs).

There are some components that have not been verified to work well under C/R scenarios, and they are not selected when you pass the '-am ' parameters cited above. The Shared Memory BTL -is- checkpoint ready, and -will- be selected (on the current 1.4, 1.5 and trunk branches). See the code below (Line 94):

The shared memory collective module [also called 'sm'] (which is not enabled under normal use due to testing - Line 89 in coll_sm_component.c) is -not- checkpoint ready (line 77), also due to testing:

So shared memory communication support has been available for checkpoint/restart functionality for a couple years now. The shared memory collective has not matured or been tested enough to be active even under non-C/R circumstances. Once it is ready, we can consider possibly trying to support it under C/R enabled activities.

I hope that clarifies what is going on.

-- Josh

On Aug 23, 2010, at 12:50 PM, <ananda.mudar_at_[hidden]> <ananda.mudar_at_[hidden]> wrote:

> Hi
> In the file “mca_base_components_open.c”, following code checks for the components that are checkpointable. If I configure OpenMPI library with “—enable-cr” option, I was under the assumption that all components will be checkpointable. However I see that quite a few components are not checkpointable and that list includes “Shared Memmory (sm)”. Do I have to add any other options to “configure” command so that all components are checkpointable? Thanks
> 186 /*
> 187 * If the user asked for a checkpoint enabled run
> 188 * then only load checkpoint enabled components.
> 189 */
> 190 if( MCA_BASE_METADATA_PARAM_CHECKPOINT & open_only_flags) {
> 191 if( MCA_BASE_METADATA_PARAM_CHECKPOINT & dummy->data.param_field) {
> 192 opal_output_verbose(10, output_id,
> 193 "mca: base: components_open: "
> 194 "(%s) Component %s is Checkpointable",
> 195 type_name,
> 196 dummy->version.mca_component_name);
> 197 }
> 198 else {
> 199 opal_output_verbose(10, output_id,
> 200 "mca: base: components_open: "
> 201 "(%s) Component %s is *NOT* Checkpointable - Disabled",
> 202 type_name,
> 203 dummy->version.mca_component_name);
> 204 opal_list_remove_item(&components_found, item);
> 205 }
> 206 }
> 207 }
> 208 }
> Thanks
> Ananda
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Please do not print this email unless it is absolutely necessary.
> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
> WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
> <ATT00001..txt>

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory