Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r20562
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-02-16 14:59:40


Josh,

Spending few minutes to understand, could have pinpointed you to the
real culprit: the tool itself!

The assert in the code state that on finalize there is still a
registered signal handler. A quick gdb show that this is for the
SIG_CHLD. Tracking the signal addition in the tool (breakpoint in gdb
on opal_event_queue_insert) clearly highlight the place where this
happens, i.e. orte_wait_init in orte/runtime/orte_wait.c:274. So far
so good, we're right of tracking the SIG_CHLD, but we're not supposed
to leave it there when we're done (as the signal is registered with
the PERSISTENT option). Leaving ... ah there is a function to cleanly
unregister them, just by the orte_wait_init, with a very clear name:
orte_wait_finalize. Wonderful, except that in the case of a tool this
is never called. Strange isn't it that no other components in the ompi
tree exhibit such a behavior. Maybe grep can help ... There we are:

[bosilca_at_dancer ompi]$ find . -name "*.c" -exec grep -Hn
orte_wait_finalize {} \;
./orte/mca/ess/hnp/ess_hnp_module.c:486: orte_wait_finalize();
./orte/mca/ess/base/ess_base_std_app.c:222: orte_wait_finalize();
./orte/mca/ess/base/ess_base_std_orted.c:310: orte_wait_finalize();
./orte/runtime/orte_wait.c:280:orte_wait_finalize(void)
./orte/runtime/orte_wait.c:872:orte_wait_finalize(void)
./orte/runtime/orte_wait.c:1182:orte_wait_finalize(void)

This clearly show that with the exception of the tools everybody else
clear their state before leaving. And here we are, a quick patch that
really fix the problem without removing code that had a really good
reason to be there.

Index: orte/mca/ess/base/ess_base_std_tool.c
===================================================================
--- orte/mca/ess/base/ess_base_std_tool.c (revision 20564)
+++ orte/mca/ess/base/ess_base_std_tool.c (working copy)
@@ -158,6 +158,8 @@

  int orte_ess_base_tool_finalize(void)
  {
+ orte_wait_finalize();
+
      /* if I am a tool, then all I will have done is
       * a very small subset of orte_init - ensure that
       * I only back those elements out

   george.

On Feb 16, 2009, at 12:57 , Josh Hursey wrote:

> This commit seems to have broken the tools. If I use orte-ps then on
> finalize I get an abort() with the following stack:
>
> shell$ orte-ps
> ...
> (gdb) bt
> #0 0x00002aaaabcee155 in raise () from /lib64/libc.so.6
> #1 0x00002aaaabcefbf0 in abort () from /lib64/libc.so.6
> #2 0x00002aaaabce75d6 in __assert_fail () from /lib64/libc.so.6
> #3 0x00002aaaaaf734e1 in opal_evsignal_dealloc (base=0x609f50) at
> signal.c:295
> #4 0x00002aaaaaf73f36 in poll_dealloc (base=0x609f50, arg=0x60a9a0)
> at poll.c:390
> #5 0x00002aaaaaf70667 in opal_event_base_free (base=0x609f50) at
> event.c:530
> #6 0x00002aaaaaf70519 in opal_event_fini () at event.c:390
> #7 0x00002aaaaaf5f624 in opal_finalize () at runtime/
> opal_finalize.c:117
> #8 0x00002aaaaacd4fc4 in orte_finalize () at runtime/
> orte_finalize.c:84
> #9 0x000000000040196a in main (argc=1, argv=0x7fffffffdf38) at orte-
> ps.c:275
>
> Any thoughts on why this is happening for only the tools case?
>
> -- Josh
>
> On Feb 14, 2009, at 4:51 PM, bosilca_at_[hidden] wrote:
>
>> Author: bosilca
>> Date: 2009-02-14 16:51:09 EST (Sat, 14 Feb 2009)
>> New Revision: 20562
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/20562
>>
>> Log:
>> Release the default base on finalize.
>>
>> Text files modified:
>> trunk/opal/event/event.c | 4 ++++
>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> Modified: trunk/opal/event/event.c
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> --- trunk/opal/event/event.c (original)
>> +++ trunk/opal/event/event.c 2009-02-14 16:51:09 EST (Sat, 14 Feb
>> 2009)
>> @@ -386,6 +386,10 @@
>> if (NULL != opal_event_module_include) {
>> opal_argv_free(opal_event_module_include);
>> }
>> + if( NULL != opal_current_base ) {
>> + event_base_free(opal_current_base);
>> + opal_current_base = NULL;
>> + }
>> return OPAL_SUCCESS;
>> }
>>
>> _______________________________________________
>> svn mailing list
>> svn_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel