Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] opal_init/finalize counter --> boolean
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-07-08 12:41:11


I see several main categories of leaks at the current head (I did not test before Ralph's changes):

- a bunch of RTE leaks
  --> which is think is what Ralph is trying to pare down

- OMPI pre-defined attribute leaks
  --> This will take a little thinking to fix; it's complicated

- OMPI pre-defined datatype leaks
  --> George says these can't be fixed, but I don't believe it :-)

- BML cleanup leaks
  --> Not sure what's happening here yet

- PML allocator leaks (ob1, csum, bfo)
  --> it looks like much is allocated (including the allocator) during component_open, but it is only freed during component_fini -- NOT during component_close. So unselected PMLs (e.g., csum, bfo) have leaks. This seems like a mistake -- should that which is freed in component_fini be moved to component_close?

- some miscellaneous leaks
  --> I just committed fixes for these in r24865

On Jul 8, 2011, at 9:37 AM, George Bosilca wrote:

> What memory leaks? My valgrind claims there are basically none at the MPI level, so I'm wondering ...
>
> First, let's make sure everyone understand why there are two initialization functions in OPAL. One has to call opal_init_util before anything else otherwise no access to utilities dealing with the command line, configurations file, output system, help and SOS stuff and so on will be available. Initializing OPAL adds all the other frameworks: the MCA base, memory, backtrace, carto and friends.
>
> Now let's look how this is used from the MPI perspective:
>
> ompi_mpi_init.c: 309 -> opal_init_util
>
> ompi_mpi_init.c:357 -> orte_init
>
> orte_init.c:81 -> opal_init
>
> opal_init.c:335 -> opal_init_util
>
> ------------------------------------
>
> ompi_mpi_finalize.c:430 -> orte_finalize
>
> orte_finalize.c:42 -> opal_finalize
>
> opal_finalize.c:163 -> opal_finalize_util
>
> ompi_mpi_finalize.c:434 -> opal_finalize_util
>
> So we have opal_init * 1 and opal_util * 2. Clearly the opal util is not a simple ON/OFF stuff. With Ralph patch the OPAL utilities will disappear as soon as the OMPI layer call orte_fini. Luckily, today there is nothing between the call to orte_fini and opal_finalize_util, so we're safe from a segfault.
>
> Moreover, from a software engineering point of view there are two choices for allowing library composition (ORTE using OPAL, OMPI using ORTE and OPAL, something else using OMPI and ORTE and OPAL). Either you do the management at the lowest level using counters, or you provide accessors to check the init/fini state of the library and do the management at the upper level (similar to the MPI library). In Open MPI and this for the last 7 years we chose the first approach. And so far there was no compelling case to switch.
>
> george.
>
> On Jul 8, 2011, at 14:42 , Jeff Squyres wrote:
>
>> Developers --
>>
>> Ralph and I talked about this one the other day; he's trying to plug some memory leaks during finalize.
>>
>> Before this commit, there was a counter that counts how many times opal_init*() is invoked. opal_finalize() decremented the counter, but didn't actually do anything else until the counter went to zero. The original intent was that we should call finalize exactly as many times as we call initialize, but it appears that we didn't do that (e.g., we call opal_init_util() and then opal_init(), but then only call opal_finalize() once).
>>
>> Does anyone have a reason NOT to convert the init/finalize counter to a simple boolean? (i.e., can you provide a reason to back this change out?)
>>
>>
>>
>> On Jul 8, 2011, at 2:43 AM, rhc_at_[hidden] wrote:
>>
>>> Author: rhc
>>> Date: 2011-07-08 02:43:19 EDT (Fri, 08 Jul 2011)
>>> New Revision: 24862
>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/24862
>>>
>>> Log:
>>> Replace a useless counter with a boolean check to see if we have already passed thru opal_finalize so we don't call finalize, and then don't pass thru it (as was happening on several tools)
>>>
>>> Text files modified:
>>> trunk/opal/runtime/opal_finalize.c | 16 ++++++----------
>>> trunk/opal/runtime/opal_init.c | 16 ++++++----------
>>> trunk/orte/tools/orterun/orterun.c | 4 ----
>>> 3 files changed, 12 insertions(+), 24 deletions(-)
>>>
>>> Modified: trunk/opal/runtime/opal_finalize.c
>>> ==============================================================================
>>> --- trunk/opal/runtime/opal_finalize.c (original)
>>> +++ trunk/opal/runtime/opal_finalize.c 2011-07-08 02:43:19 EDT (Fri, 08 Jul 2011)
>>> @@ -54,18 +54,16 @@
>>> #include "opal/runtime/opal_cr.h"
>>> #include "opal/mca/crs/base/base.h"
>>>
>>> -extern int opal_initialized;
>>> -extern int opal_util_initialized;
>>> +extern bool opal_initialized;
>>> +extern bool opal_util_initialized;
>>>
>>> int
>>> opal_finalize_util(void)
>>> {
>>> - if( --opal_util_initialized != 0 ) {
>>> - if( opal_util_initialized < 0 ) {
>>> - return OPAL_ERROR;
>>> - }
>>> + if (!opal_util_initialized) {
>>> return OPAL_SUCCESS;
>>> }
>>> + opal_util_initialized = false;
>>>
>>> /* Clear out all the registered MCA params */
>>> mca_base_param_finalize();
>>> @@ -111,12 +109,10 @@
>>> int
>>> opal_finalize(void)
>>> {
>>> - if( --opal_initialized != 0 ) {
>>> - if( opal_initialized < 0 ) {
>>> - return OPAL_ERROR;
>>> - }
>>> + if (!opal_initialized) {
>>> return OPAL_SUCCESS;
>>> }
>>> + opal_initialized = false;
>>>
>>> /* close the checkpoint and restart service */
>>> opal_cr_finalize();
>>>
>>> Modified: trunk/opal/runtime/opal_init.c
>>> ==============================================================================
>>> --- trunk/opal/runtime/opal_init.c (original)
>>> +++ trunk/opal/runtime/opal_init.c 2011-07-08 02:43:19 EDT (Fri, 08 Jul 2011)
>>> @@ -68,8 +68,8 @@
>>> #endif
>>> const char opal_version_string[] = OPAL_IDENT_STRING;
>>>
>>> -int opal_initialized = 0;
>>> -int opal_util_initialized = 0;
>>> +bool opal_initialized = false;
>>> +bool opal_util_initialized = false;
>>> int opal_cache_line_size;
>>>
>>> static int
>>> @@ -219,12 +219,10 @@
>>> int ret;
>>> char *error = NULL;
>>>
>>> - if( ++opal_util_initialized != 1 ) {
>>> - if( opal_util_initialized < 1 ) {
>>> - return OPAL_ERROR;
>>> - }
>>> + if (opal_util_initialized) {
>>> return OPAL_SUCCESS;
>>> }
>>> + opal_util_initialized = true;
>>>
>>> /* JMS See note in runtime/opal.h -- this is temporary; to be
>>> replaced with real hwloc information soon (in trunk/v1.5 and
>>> @@ -324,12 +322,10 @@
>>> int ret;
>>> char *error = NULL;
>>>
>>> - if( ++opal_initialized != 1 ) {
>>> - if( opal_initialized < 1 ) {
>>> - return OPAL_ERROR;
>>> - }
>>> + if (opal_initialized) {
>>> return OPAL_SUCCESS;
>>> }
>>> + opal_initialized = true;
>>>
>>> /* initialize util code */
>>> if (OPAL_SUCCESS != (ret = opal_init_util(pargc, pargv))) {
>>>
>>> Modified: trunk/orte/tools/orterun/orterun.c
>>> ==============================================================================
>>> --- trunk/orte/tools/orterun/orterun.c (original)
>>> +++ trunk/orte/tools/orterun/orterun.c 2011-07-08 02:43:19 EDT (Fri, 08 Jul 2011)
>>> @@ -618,10 +618,6 @@
>>> ORTE_ERROR_LOG(rc);
>>> return rc;
>>> }
>>> - /* finalize the OPAL utils. As they are opened again from orte_init->opal_init
>>> - * we continue to have a reference count on them. So we have to finalize them twice...
>>> - */
>>> - opal_finalize_util();
>>>
>>> /* check for request to report uri */
>>> if (NULL != orterun_globals.report_uri) {
>>> _______________________________________________
>>> svn-full mailing list
>>> svn-full_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/