Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r20562
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-16 15:07:16


George --

Will you commit?

On Feb 16, 2009, at 2:59 PM, George Bosilca wrote:

> Josh,
>
> Spending few minutes to understand, could have pinpointed you to the
> real culprit: the tool itself!
>
> The assert in the code state that on finalize there is still a
> registered signal handler. A quick gdb show that this is for the
> SIG_CHLD. Tracking the signal addition in the tool (breakpoint in
> gdb on opal_event_queue_insert) clearly highlight the place where
> this happens, i.e. orte_wait_init in orte/runtime/orte_wait.c:274.
> So far so good, we're right of tracking the SIG_CHLD, but we're not
> supposed to leave it there when we're done (as the signal is
> registered with the PERSISTENT option). Leaving ... ah there is a
> function to cleanly unregister them, just by the orte_wait_init,
> with a very clear name: orte_wait_finalize. Wonderful, except that
> in the case of a tool this is never called. Strange isn't it that no
> other components in the ompi tree exhibit such a behavior. Maybe
> grep can help ... There we are:
>
> [bosilca_at_dancer ompi]$ find . -name "*.c" -exec grep -Hn
> orte_wait_finalize {} \;
> ./orte/mca/ess/hnp/ess_hnp_module.c:486: orte_wait_finalize();
> ./orte/mca/ess/base/ess_base_std_app.c:222: orte_wait_finalize();
> ./orte/mca/ess/base/ess_base_std_orted.c:310: orte_wait_finalize();
> ./orte/runtime/orte_wait.c:280:orte_wait_finalize(void)
> ./orte/runtime/orte_wait.c:872:orte_wait_finalize(void)
> ./orte/runtime/orte_wait.c:1182:orte_wait_finalize(void)
>
> This clearly show that with the exception of the tools everybody
> else clear their state before leaving. And here we are, a quick
> patch that really fix the problem without removing code that had a
> really good reason to be there.
>
> Index: orte/mca/ess/base/ess_base_std_tool.c
> ===================================================================
> --- orte/mca/ess/base/ess_base_std_tool.c (revision 20564)
> +++ orte/mca/ess/base/ess_base_std_tool.c (working copy)
> @@ -158,6 +158,8 @@
>
> int orte_ess_base_tool_finalize(void)
> {
> + orte_wait_finalize();
> +
> /* if I am a tool, then all I will have done is
> * a very small subset of orte_init - ensure that
> * I only back those elements out
>
>
> george.
>
>
> On Feb 16, 2009, at 12:57 , Josh Hursey wrote:
>
>> This commit seems to have broken the tools. If I use orte-ps then
>> on finalize I get an abort() with the following stack:
>>
>> shell$ orte-ps
>> ...
>> (gdb) bt
>> #0 0x00002aaaabcee155 in raise () from /lib64/libc.so.6
>> #1 0x00002aaaabcefbf0 in abort () from /lib64/libc.so.6
>> #2 0x00002aaaabce75d6 in __assert_fail () from /lib64/libc.so.6
>> #3 0x00002aaaaaf734e1 in opal_evsignal_dealloc (base=0x609f50) at
>> signal.c:295
>> #4 0x00002aaaaaf73f36 in poll_dealloc (base=0x609f50,
>> arg=0x60a9a0) at poll.c:390
>> #5 0x00002aaaaaf70667 in opal_event_base_free (base=0x609f50) at
>> event.c:530
>> #6 0x00002aaaaaf70519 in opal_event_fini () at event.c:390
>> #7 0x00002aaaaaf5f624 in opal_finalize () at runtime/
>> opal_finalize.c:117
>> #8 0x00002aaaaacd4fc4 in orte_finalize () at runtime/
>> orte_finalize.c:84
>> #9 0x000000000040196a in main (argc=1, argv=0x7fffffffdf38) at
>> orte-ps.c:275
>>
>> Any thoughts on why this is happening for only the tools case?
>>
>> -- Josh
>>
>> On Feb 14, 2009, at 4:51 PM, bosilca_at_[hidden] wrote:
>>
>>> Author: bosilca
>>> Date: 2009-02-14 16:51:09 EST (Sat, 14 Feb 2009)
>>> New Revision: 20562
>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/20562
>>>
>>> Log:
>>> Release the default base on finalize.
>>>
>>> Text files modified:
>>> trunk/opal/event/event.c | 4 ++++
>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>
>>> Modified: trunk/opal/event/event.c
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> ====================================================================
>>> --- trunk/opal/event/event.c (original)
>>> +++ trunk/opal/event/event.c 2009-02-14 16:51:09 EST (Sat, 14 Feb
>>> 2009)
>>> @@ -386,6 +386,10 @@
>>> if (NULL != opal_event_module_include) {
>>> opal_argv_free(opal_event_module_include);
>>> }
>>> + if( NULL != opal_current_base ) {
>>> + event_base_free(opal_current_base);
>>> + opal_current_base = NULL;
>>> + }
>>> return OPAL_SUCCESS;
>>> }
>>>
>>> _______________________________________________
>>> svn mailing list
>>> svn_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems