Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r20562
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-16 15:07:41


Never mind -- you just did. Thanks! :-)

On Feb 16, 2009, at 3:07 PM, Jeff Squyres wrote:

> George --
>
> Will you commit?
>
> On Feb 16, 2009, at 2:59 PM, George Bosilca wrote:
>
>> Josh,
>>
>> Spending few minutes to understand, could have pinpointed you to
>> the real culprit: the tool itself!
>>
>> The assert in the code state that on finalize there is still a
>> registered signal handler. A quick gdb show that this is for the
>> SIG_CHLD. Tracking the signal addition in the tool (breakpoint in
>> gdb on opal_event_queue_insert) clearly highlight the place where
>> this happens, i.e. orte_wait_init in orte/runtime/orte_wait.c:274.
>> So far so good, we're right of tracking the SIG_CHLD, but we're not
>> supposed to leave it there when we're done (as the signal is
>> registered with the PERSISTENT option). Leaving ... ah there is a
>> function to cleanly unregister them, just by the orte_wait_init,
>> with a very clear name: orte_wait_finalize. Wonderful, except that
>> in the case of a tool this is never called. Strange isn't it that
>> no other components in the ompi tree exhibit such a behavior. Maybe
>> grep can help ... There we are:
>>
>> [bosilca_at_dancer ompi]$ find . -name "*.c" -exec grep -Hn
>> orte_wait_finalize {} \;
>> ./orte/mca/ess/hnp/ess_hnp_module.c:486: orte_wait_finalize();
>> ./orte/mca/ess/base/ess_base_std_app.c:222: orte_wait_finalize();
>> ./orte/mca/ess/base/ess_base_std_orted.c:310:
>> orte_wait_finalize();
>> ./orte/runtime/orte_wait.c:280:orte_wait_finalize(void)
>> ./orte/runtime/orte_wait.c:872:orte_wait_finalize(void)
>> ./orte/runtime/orte_wait.c:1182:orte_wait_finalize(void)
>>
>> This clearly show that with the exception of the tools everybody
>> else clear their state before leaving. And here we are, a quick
>> patch that really fix the problem without removing code that had a
>> really good reason to be there.
>>
>> Index: orte/mca/ess/base/ess_base_std_tool.c
>> ===================================================================
>> --- orte/mca/ess/base/ess_base_std_tool.c (revision 20564)
>> +++ orte/mca/ess/base/ess_base_std_tool.c (working copy)
>> @@ -158,6 +158,8 @@
>>
>> int orte_ess_base_tool_finalize(void)
>> {
>> + orte_wait_finalize();
>> +
>> /* if I am a tool, then all I will have done is
>> * a very small subset of orte_init - ensure that
>> * I only back those elements out
>>
>>
>> george.
>>
>>
>> On Feb 16, 2009, at 12:57 , Josh Hursey wrote:
>>
>>> This commit seems to have broken the tools. If I use orte-ps then
>>> on finalize I get an abort() with the following stack:
>>>
>>> shell$ orte-ps
>>> ...
>>> (gdb) bt
>>> #0 0x00002aaaabcee155 in raise () from /lib64/libc.so.6
>>> #1 0x00002aaaabcefbf0 in abort () from /lib64/libc.so.6
>>> #2 0x00002aaaabce75d6 in __assert_fail () from /lib64/libc.so.6
>>> #3 0x00002aaaaaf734e1 in opal_evsignal_dealloc (base=0x609f50) at
>>> signal.c:295
>>> #4 0x00002aaaaaf73f36 in poll_dealloc (base=0x609f50,
>>> arg=0x60a9a0) at poll.c:390
>>> #5 0x00002aaaaaf70667 in opal_event_base_free (base=0x609f50) at
>>> event.c:530
>>> #6 0x00002aaaaaf70519 in opal_event_fini () at event.c:390
>>> #7 0x00002aaaaaf5f624 in opal_finalize () at runtime/
>>> opal_finalize.c:117
>>> #8 0x00002aaaaacd4fc4 in orte_finalize () at runtime/
>>> orte_finalize.c:84
>>> #9 0x000000000040196a in main (argc=1, argv=0x7fffffffdf38) at
>>> orte-ps.c:275
>>>
>>> Any thoughts on why this is happening for only the tools case?
>>>
>>> -- Josh
>>>
>>> On Feb 14, 2009, at 4:51 PM, bosilca_at_[hidden] wrote:
>>>
>>>> Author: bosilca
>>>> Date: 2009-02-14 16:51:09 EST (Sat, 14 Feb 2009)
>>>> New Revision: 20562
>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/20562
>>>>
>>>> Log:
>>>> Release the default base on finalize.
>>>>
>>>> Text files modified:
>>>> trunk/opal/event/event.c | 4 ++++
>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>
>>>> Modified: trunk/opal/event/event.c
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> ===================================================================
>>>> --- trunk/opal/event/event.c (original)
>>>> +++ trunk/opal/event/event.c 2009-02-14 16:51:09 EST (Sat, 14 Feb
>>>> 2009)
>>>> @@ -386,6 +386,10 @@
>>>> if (NULL != opal_event_module_include) {
>>>> opal_argv_free(opal_event_module_include);
>>>> }
>>>> + if( NULL != opal_current_base ) {
>>>> + event_base_free(opal_current_base);
>>>> + opal_current_base = NULL;
>>>> + }
>>>> return OPAL_SUCCESS;
>>>> }
>>>>
>>>> _______________________________________________
>>>> svn mailing list
>>>> svn_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems