Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25248
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-10-12 13:19:31


On Oct 11, 2011, at 16:56 , Ralph Castain wrote:

> We actually have a number of modules that are allowed to terminate daemons, so it really isn't that big a deal. However, I can agree that this code is unnecessary so long as any code that calls route_lost remembers to also check for daemon termination conditions. I -think- that's the case today, but will check and correct if necessary.

There are __actually__ 3 modules that use orte_quit outside the error managers. 2 of the references are in the ess, and they are used in cases where everything is broken and the ess can't figure out how to move from there on. The last one is slurm in the termination, so this might also be acceptable.

> I'll remove this when I revisit the termination issue in general.

I don't think this issue has to be revisited again. Let's leave it the way it was, with was site consistent. In other words not adding extra calls to termination all over the code base. Moreover, in the current design the error managers are the one deciding how to handle error conditions (including losing connections to other daemons), which means nobody has to check anything related to daemon termination conditions after calling route_lost as the call will end in the error manager at one point.

  Regards,
    george.

>
> On Oct 11, 2011, at 11:25 AM, George Bosilca wrote:
>
>> The second part of this patch is fascinating. Why would a routed be allowed to terminate a daemon? And why such discrimination (in the sense that they are not allowed to shortcut to orte_quit) against all our routed ?
>>
>> Thanks,
>> george.
>>
>> Begin forwarded message:
>>
>>> Modified: trunk/orte/mca/routed/binomial/routed_binomial.c
>>> ==============================================================================
>>> --- trunk/orte/mca/routed/binomial/routed_binomial.c (original)
>>> +++ trunk/orte/mca/routed/binomial/routed_binomial.c 2011-10-10 17:41:49 EDT (Mon, 10 Oct 2011)
>>> @@ -32,6 +32,7 @@
>>> #include "orte/util/nidmap.h"
>>> #include "orte/runtime/orte_globals.h"
>>> #include "orte/runtime/orte_wait.h"
>>> +#include "orte/runtime/orte_quit.h"
>>> #include "orte/runtime/runtime.h"
>>> #include "orte/runtime/data_type_support/orte_dt_support.h"
>>>
>>> @@ -830,11 +831,22 @@
>>> item = opal_list_get_next(item)) {
>>> child = (orte_routed_tree_t*)item;
>>> if (child->vpid == route->vpid) {
>>> + OPAL_OUTPUT_VERBOSE((4, orte_routed_base_output,
>>> + "%s routed_binomial: removing route to child daemon %s",
>>> + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>> + ORTE_NAME_PRINT(route)));
>>> opal_list_remove_item(&my_children, item);
>>> OBJ_RELEASE(item);
>>> return ORTE_SUCCESS;
>>> }
>>> }
>>> + /* if we are the HNP or daemon, AND we are terminating,
>>> + * then we want to finalize if all our child daemons
>>> + * have left
>>> + */
>>> + if (orte_terminating && 0 == opal_list_get_size(&my_children)) {
>>> + orte_quit();
>>> + }
>>> }
>>>
>>> /* we don't care about this one, so return success */
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel