It might be worthwhile to run this through valgrind and see if something is being freed incorrectly...?
On Dec 16, 2013, at 12:11 PM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
> I took a look at the stacktraces last week and could not identify where the bug
> is. I will dig deeper this week and see if I can come up with the correct fix.
> On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
>> Could you please comment on the Igor`s observations?
>> On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov <igor.ivanov_at_[hidden]>
>> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
>> On Dec 4, 2013, at 2:52 AM, Igor Ivanov <Igor.Ivanov_at_[hidden]>
>> It is the first mca variable with type as string from btl/openib as
>> 'device_param_files'. Actually you can disable it and get failure on
>> the second.
>> Description of case we see:
>> 1. openib mca variables are registered during startup as stage at
>> select component phase;
>> 2. but a winner is cm component and openib mca variables are
>> deregistered as part of mca group;
>> 3. mca variables are not removed from global mca array but they
>> marked as invalid and memory for string is freed;
>> 4. shmem needs openib for yoda and does bml initialization;
>> 5. openib mca variables are registered againusing light mode as
>> searching itself in global array and refreshing their fields again;
>> Can you explain what you mean by step 5? I.e., what does "using light
>> mode" mean? Is the openib component register function invoked again?
>> It is correct, it is called twice. "light mode" means that
>> mca_base_var_register() does not allocate mca variable object again, it
>> seeks this variable in global array and finding it updates fields in
>> mca_base_var_t structure (at least mbv_storage).
>> 6. for unknown reason bml finalization does not clean these vars as
>> it is done in step 2;
>> 7. mca_btl_openib.so is unloaded;
>> 8. opal_finalize() destroys mca variables form global array,
>> observes openib`s variable, try destroy using non accessed address;
>> So a code that is under discussion fixes step 6.
>> Nathan: it sounds like an MCA var (and entire group) is registered,
>> unregistered, and then registered again. Does the MCA var system get
>> confused here when it tries to unregister the group a 2nd time?
>> Probably issue relates incorrect recognition if variable valid/invalid
>> during second call of mca_base_var_deregister().
>> devel mailing list
>> devel mailing list
> devel mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/