Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] bug in mca framework?
From: Igor Ivanov (igor.ivanov_at_[hidden])
Date: 2013-12-04 09:44:34


On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
> On Dec 4, 2013, at 2:52 AM, Igor Ivanov <Igor.Ivanov_at_[hidden]> wrote:
>
>> It is the first mca variable with type as string from btl/openib as 'device_param_files'. Actually you can disable it and get failure on the second.
>>
>> Description of case we see:
>> 1. openib mca variables are registered during startup as stage at select component phase;
>> 2. but a winner is cm component and openib mca variables are deregistered as part of mca group;
>> 3. mca variables are not removed from global mca array but they marked as invalid and memory for string is freed;
>> 4. shmem needs openib for yoda and does bml initialization;
>> 5. openib mca variables are registered againusing light mode as searching itself in global array and refreshing their fields again;
> Can you explain what you mean by step 5? I.e., what does "using light mode" mean? Is the openib component register function invoked again?
It is correct, it is called twice. "light mode" means that
mca_base_var_register() does not allocate mca variable object again, it
seeks this variable in global array and finding it updates fields in
mca_base_var_t structure (at least mbv_storage).
>
>> 6. for unknown reason bml finalization does not clean these vars as it is done in step 2;
>> 7. mca_btl_openib.so is unloaded;
>> 8. opal_finalize() destroys mca variables form global array, observes openib`s variable, try destroy using non accessed address;
>>
>> So a code that is under discussion fixes step 6.
> Nathan: it sounds like an MCA var (and entire group) is registered, unregistered, and then registered again. Does the MCA var system get confused here when it tries to unregister the group a 2nd time?
Probably issue relates incorrect recognition if variable valid/invalid
during second call of mca_base_var_deregister().