Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] usage of mca variables in orte-restart
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-03-14 17:57:31


I don't believe we support changing the value of an MCA param on-the-fly - you'd need to transfer it to an appropriate-level global that you can change as required

On Mar 14, 2014, at 2:05 PM, Adrian Reber <adrian_at_[hidden]> wrote:

> I am now trying to run orte-restart. As far as I understand it
> orte-restart analyzes the checkpoint metadata and then tries to exec()
> mpirun which then starts opal-restart. During the startup of
> opal-restart (during initialize()) detection of the best CRS module is
> disabled:
>
> /*
> * Turn off the selection of the CRS component,
> * we need to do that later
> */
> (void) mca_base_var_env_name("crs_base_do_not_select", &tmp_env_var);
> opal_setenv(tmp_env_var,
> "1", /* turn off the selection */
> true, &environ);
> free(tmp_env_var);
> tmp_env_var = NULL;
>
> This seems to work. Later when actually selecting the correct CRS module
> to restart the checkpointed process the selection is enabled again:
>
> /* Re-enable the selection of the CRS component, so we can choose the right one */
> (void) mca_base_var_env_name("crs_base_do_not_select", &tmp_env_var);
> opal_setenv(tmp_env_var,
> "0", /* turn on the selection */
> true, &environ);
> free(tmp_env_var);
> tmp_env_var = NULL;
>
> This does not seem to have an effect. The one reason why it does not work
> is pretty obvious. The mca variable crs_base_do_not_select is registered during
> opal_crs_base_register() and written to the bool variable opal_crs_base_do_not_select
> only once (during register). Later in opal_crs_base_select() this bool
> variable is queried if select should run or not and as it is only changed
> during register it never changes. So from the code flow it cannot work
> and is probably the result of one of the rewrites since C/R was introduced.
>
> To fix this I am trying to read the value of the MCA variable
> opal_crs_base_do_not_select during opal_crs_base_select() like this:
>
> idx = mca_base_var_find("opal", "crs", "base", "do_not_select")
> mca_base_var_get_value(idx, &value, NULL, NULL);
>
> This also seems to work because it is different if I change the first
> opal_setenv() during initialize(). The problem I am seeing is that the
> second opal_setenv() (back to 0) cannot be detected using mca_base_var_get_value().
>
> So my question is: what is the preferred way to read and write MCA
> variables to access them in the different modules? Is the existing
> code still correct? There is also mca_base_var_set_value() should I rather
> use this to set 'opal_crs_base_do_not_select'. I was, however, not able
> to use mca_base_var_set_value() without a segfault. There are not much
> uses of mca_base_var_set_value() in the existing code and none uses
> a bool variable.
>
> I also discovered I can just access to global C variable 'opal_crs_base_do_not_select'
> from opal-restart.c as well as from opal_crs_base_select(). This also works.
> This would solve my problem setting and reading MCA variables.
>
> Adrian
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14347.php