Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] bug in mca framework?
From: Igor Ivanov (igor.ivanov_at_[hidden])
Date: 2013-12-04 02:52:31


It is the first mca variable with type as string from btl/openib as
'device_param_files'. Actually you can disable it and get failure on the
second.

Description of case we see:
1. openib mca variables are registered during startup as stage at select
component phase;
2. but a winner is cm component and openib mca variables are
deregistered as part of mca group;
3. mca variables are not removed from global mca array but they marked
as invalid and memory for string is freed;
4. shmem needs openib for yoda and does bml initialization;
5. openib mca variables are registered againusing light mode as
searching itself in global array and refreshing their fields again;
6. for unknown reason bml finalization does not clean these vars as it
is done in step 2;
7. mca_btl_openib.so is unloaded;
8. opal_finalize() destroys mca variables form global array, observes
openib`s variable, try destroy using non accessed address;

So a code that is under discussion fixes step 6.

Igor

On 03.12.2013 23:01, Jeff Squyres (jsquyres) wrote:
> I don't think there is one -- you'll need to print it from the debugger.
>
>
> On Dec 3, 2013, at 1:38 PM, Mike Dubman <miked_at_[hidden]> wrote:
>
>> thanks
>> what magic "-mca base_verbose" param should print it?
>>
>>
>> On Tue, Dec 3, 2013 at 6:59 PM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
>> This usually happens when a string that belongs to the MCA system is freed
>> elsewhere. Can you find out the name of the variable that is being destructed
>> in frame 2.
>>
>> -Nathan Hjelm
>> Application Readiness, HPC-5, LANL
>>
>> On Tue, Dec 03, 2013 at 02:53:29PM +0200, Mike Dubman wrote:
>>> Hi,
>>> We observe crash during shmem_finalize() (in trunk) with new MCA
>>> framework.
>>> After investigation, found that MCA tears-down process can access
>>> previously released memory. (reproduced with oshmem_hello_c.c test)
>>> 0 0x00007fffed3d51d0 in ?? ()
>>> #1 <signal handler called>
>>> #2 0x00007ffff710e21e in var_destructor (var=0x6fa7e0) at
>>> mca_base_var.c:1605
>>> #3 0x00007ffff710ae99 in opal_obj_run_destructors (object=0x6fa7e0) at
>>> ../../../opal/class/opal_object.h:448
>>> #4 0x00007ffff710ca18 in mca_base_var_finalize () at mca_base_var.c:954
>>> #5 0x00007ffff710a7e2 in mca_base_param_finalize () at
>>> mca_base_param.c:643
>>> #6 0x00007ffff70e08e2 in opal_finalize_util () at
>>> runtime/opal_finalize.c:77
>>> #7 0x00007ffff7aa5319 in ompi_mpi_finalize () at
>>> runtime/ompi_mpi_finalize.c:407
>>> #8 0x00007ffff7d900cc in oshmem_shmem_finalize () at
>>> runtime/oshmem_shmem_finalize.c:75
>>> #9 0x00007ffff7d91119 in shmem_finalize () at shmem_finalize.c:24
>>> #10 0x00007ffff7d89b8f in __do_global_dtors_aux () from
>>> /install/lib/libshmem.so.0
>>> #11 0x0000000000000000 in ?? ()
>>> The crash can be resolved by following patch:
>>> diff --git a/opal/mca/base/mca_base_var.c b/opal/mca/base/mca_base_var.c
>>> index 9966627..48028d8 100644
>>> --- a/opal/mca/base/mca_base_var.c
>>> +++ b/opal/mca/base/mca_base_var.c
>>> @@ -773,7 +773,7 @@ static int var_find_by_name (const char *full_name,
>>> int *index, bool invalidok)
>>>
>>> (void) var_get ((int)(uintptr_t) tmp, &var, false);
>>>
>>> - if (invalidok || VAR_IS_VALID(var[0])) {
>>> + if (VAR_IS_VALID(var[0])) {
>>> *index = (int)(uintptr_t) tmp;
>>> return OPAL_SUCCESS;
>>> }
>>> I`m not sure we understand yet why it fixes the problem and what is a
>>> race.
>>> Could some` with knowledge of MCA flows look at it and comment?
>>> The "invalidok" was introduced by Jeff`s commit.
>>> Thanks
>>> M
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>