Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] opal_init/finalize counter --> boolean
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-07-08 10:15:43


On Jul 8, 2011, at 7:37 AM, George Bosilca wrote:

> What memory leaks? My valgrind claims there are basically none at the MPI level, so I'm wondering ...

We leak nearly a megabyte, almost all of it from OPAL, when running a tool. If you look at the commit message, this wasn't about an MPI proc.

>
> First, let's make sure everyone understand why there are two initialization functions in OPAL. One has to call opal_init_util before anything else otherwise no access to utilities dealing with the command line, configurations file, output system, help and SOS stuff and so on will be available. Initializing OPAL adds all the other frameworks: the MCA base, memory, backtrace, carto and friends.

Agreed

>
> Now let's look how this is used from the MPI perspective:
>
> ompi_mpi_init.c: 309 -> opal_init_util
>
> ompi_mpi_init.c:357 -> orte_init
>
> orte_init.c:81 -> opal_init
>
> opal_init.c:335 -> opal_init_util
>
> ------------------------------------
>
> ompi_mpi_finalize.c:430 -> orte_finalize
>
> orte_finalize.c:42 -> opal_finalize
>
> opal_finalize.c:163 -> opal_finalize_util
>
> ompi_mpi_finalize.c:434 -> opal_finalize_util
>
> So we have opal_init * 1 and opal_util * 2. Clearly the opal util is not a simple ON/OFF stuff. With Ralph patch the OPAL utilities will disappear as soon as the OMPI layer call orte_fini. Luckily, today there is nothing between the call to orte_fini and opal_finalize_util, so we're safe from a segfault.

The point is that you shouldn't be calling opal_finalize_util separately. We do so now only because of the counter - there is no reason for doing it separately otherwise.

In other words, we created a counter, and then modified the code to make the counter work. There is no reason for it to exist as there is no use of the opal utilities following the call to orte_finalize.

>
> Moreover, from a software engineering point of view there are two choices for allowing library composition (ORTE using OPAL, OMPI using ORTE and OPAL, something else using OMPI and ORTE and OPAL). Either you do the management at the lowest level using counters, or you provide accessors to check the init/fini state of the library and do the management at the upper level (similar to the MPI library). In Open MPI and this for the last 7 years we chose the first approach. And so far there was no compelling case to switch.

Yes there was - we just never checked it. None of the tools were calling opal_finalize multiple times. There was an inherent understanding that calling orte_finalize would shut everything down. This wasn't the case because this hidden counter wasn't getting zero'd, and so opal_finalize never actually executed.

Now imagine there is an abnormal termination. You can't know for sure where it occurs - did we increment the counter already, or not? So how many times do I have to call opal_finalize and opal_finalize_util to get them to actually execute?

The way things sat, I could only loop over opal_finalize and opal_finalize_util until I got back an error indicating it had finally executed. That is plain ugly.

It isn't a big deal, but creates a hidden 'gotcha' that results in some ugly code to compensate if you want to cleanly terminate under all conditions. If you have a compelling case where someone needs to access the opal utils -after- having called orte_finalize or opal_finalize, then I would welcome hearing about it.

>
> george.
>
> On Jul 8, 2011, at 14:42 , Jeff Squyres wrote:
>
>> Developers --
>>
>> Ralph and I talked about this one the other day; he's trying to plug some memory leaks during finalize.
>>
>> Before this commit, there was a counter that counts how many times opal_init*() is invoked. opal_finalize() decremented the counter, but didn't actually do anything else until the counter went to zero. The original intent was that we should call finalize exactly as many times as we call initialize, but it appears that we didn't do that (e.g., we call opal_init_util() and then opal_init(), but then only call opal_finalize() once).
>>
>> Does anyone have a reason NOT to convert the init/finalize counter to a simple boolean? (i.e., can you provide a reason to back this change out?)
>>
>>
>>
>> On Jul 8, 2011, at 2:43 AM, rhc_at_[hidden] wrote:
>>
>>> Author: rhc
>>> Date: 2011-07-08 02:43:19 EDT (Fri, 08 Jul 2011)
>>> New Revision: 24862
>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/24862
>>>
>>> Log:
>>> Replace a useless counter with a boolean check to see if we have already passed thru opal_finalize so we don't call finalize, and then don't pass thru it (as was happening on several tools)
>>>
>>> Text files modified:
>>> trunk/opal/runtime/opal_finalize.c | 16 ++++++----------
>>> trunk/opal/runtime/opal_init.c | 16 ++++++----------
>>> trunk/orte/tools/orterun/orterun.c | 4 ----
>>> 3 files changed, 12 insertions(+), 24 deletions(-)
>>>
>>> Modified: trunk/opal/runtime/opal_finalize.c
>>> ==============================================================================
>>> --- trunk/opal/runtime/opal_finalize.c (original)
>>> +++ trunk/opal/runtime/opal_finalize.c 2011-07-08 02:43:19 EDT (Fri, 08 Jul 2011)
>>> @@ -54,18 +54,16 @@
>>> #include "opal/runtime/opal_cr.h"
>>> #include "opal/mca/crs/base/base.h"
>>>
>>> -extern int opal_initialized;
>>> -extern int opal_util_initialized;
>>> +extern bool opal_initialized;
>>> +extern bool opal_util_initialized;
>>>
>>> int
>>> opal_finalize_util(void)
>>> {
>>> - if( --opal_util_initialized != 0 ) {
>>> - if( opal_util_initialized < 0 ) {
>>> - return OPAL_ERROR;
>>> - }
>>> + if (!opal_util_initialized) {
>>> return OPAL_SUCCESS;
>>> }
>>> + opal_util_initialized = false;
>>>
>>> /* Clear out all the registered MCA params */
>>> mca_base_param_finalize();
>>> @@ -111,12 +109,10 @@
>>> int
>>> opal_finalize(void)
>>> {
>>> - if( --opal_initialized != 0 ) {
>>> - if( opal_initialized < 0 ) {
>>> - return OPAL_ERROR;
>>> - }
>>> + if (!opal_initialized) {
>>> return OPAL_SUCCESS;
>>> }
>>> + opal_initialized = false;
>>>
>>> /* close the checkpoint and restart service */
>>> opal_cr_finalize();
>>>
>>> Modified: trunk/opal/runtime/opal_init.c
>>> ==============================================================================
>>> --- trunk/opal/runtime/opal_init.c (original)
>>> +++ trunk/opal/runtime/opal_init.c 2011-07-08 02:43:19 EDT (Fri, 08 Jul 2011)
>>> @@ -68,8 +68,8 @@
>>> #endif
>>> const char opal_version_string[] = OPAL_IDENT_STRING;
>>>
>>> -int opal_initialized = 0;
>>> -int opal_util_initialized = 0;
>>> +bool opal_initialized = false;
>>> +bool opal_util_initialized = false;
>>> int opal_cache_line_size;
>>>
>>> static int
>>> @@ -219,12 +219,10 @@
>>> int ret;
>>> char *error = NULL;
>>>
>>> - if( ++opal_util_initialized != 1 ) {
>>> - if( opal_util_initialized < 1 ) {
>>> - return OPAL_ERROR;
>>> - }
>>> + if (opal_util_initialized) {
>>> return OPAL_SUCCESS;
>>> }
>>> + opal_util_initialized = true;
>>>
>>> /* JMS See note in runtime/opal.h -- this is temporary; to be
>>> replaced with real hwloc information soon (in trunk/v1.5 and
>>> @@ -324,12 +322,10 @@
>>> int ret;
>>> char *error = NULL;
>>>
>>> - if( ++opal_initialized != 1 ) {
>>> - if( opal_initialized < 1 ) {
>>> - return OPAL_ERROR;
>>> - }
>>> + if (opal_initialized) {
>>> return OPAL_SUCCESS;
>>> }
>>> + opal_initialized = true;
>>>
>>> /* initialize util code */
>>> if (OPAL_SUCCESS != (ret = opal_init_util(pargc, pargv))) {
>>>
>>> Modified: trunk/orte/tools/orterun/orterun.c
>>> ==============================================================================
>>> --- trunk/orte/tools/orterun/orterun.c (original)
>>> +++ trunk/orte/tools/orterun/orterun.c 2011-07-08 02:43:19 EDT (Fri, 08 Jul 2011)
>>> @@ -618,10 +618,6 @@
>>> ORTE_ERROR_LOG(rc);
>>> return rc;
>>> }
>>> - /* finalize the OPAL utils. As they are opened again from orte_init->opal_init
>>> - * we continue to have a reference count on them. So we have to finalize them twice...
>>> - */
>>> - opal_finalize_util();
>>>
>>> /* check for request to report uri */
>>> if (NULL != orterun_globals.report_uri) {
>>> _______________________________________________
>>> svn-full mailing list
>>> svn-full_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel