Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] bug in mca framework?
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2013-12-16 12:43:25


On Mon, Dec 16, 2013 at 05:21:05PM +0000, Joshua Ladd wrote:
> After speaking with Igor Ivanov about this this morning, he summarized his findings as follows:
>
> 1. Valgrind comes up clean.

Thats good to hear but unfortunate since this seems really like a
stomping-on-memory problem.

> 2. The issue is not reproduced with a static build.

This is a red-herring. The variable itself contains garbage. The
mbv_storage pointer looked like it was on the stack, the name was not
valid, etc. Not sure how we got an mca_base_var_t into that state since
the only time we touch anything in them is in
mca_base_var_finalize. That functions cleans up all of the state to two
calls to it should be harmless.

> 3. A bisection study reveals that problems first appear after commit:
> https://svn.open-mpi.org/trac/ompi/changeset/28800/trunk/opal/mca/base/mca_base_var.c

Possibly also a coincidence. That commit only 1) moves the group stuff
into its own file, and 2) adds the mca_base_pvar interface. Its possible
I messed something up in the rest of the code but unlikely. I will take
another look though.

-Nathan

>
>
> Josh
>
> -----Original Message-----
> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Jeff Squyres (jsquyres)
> Sent: Monday, December 16, 2013 12:15 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] bug in mca framework?
>
> It might be worthwhile to run this through valgrind and see if something is being freed incorrectly...?
>
>
> On Dec 16, 2013, at 12:11 PM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
>
> > I took a look at the stacktraces last week and could not identify
> > where the bug is. I will dig deeper this week and see if I can come up with the correct fix.
> >
> > -Nathan
> >
> > On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
> >> Nathan,
> >> Could you please comment on the Igor`s observations?
> >> Thanks
> >>
> >> On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov <igor.ivanov_at_[hidden]>
> >> wrote:
> >>
> >> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
> >>
> >> On Dec 4, 2013, at 2:52 AM, Igor Ivanov <Igor.Ivanov_at_[hidden]>
> >> wrote:
> >>
> >> It is the first mca variable with type as string from btl/openib as
> >> 'device_param_files'. Actually you can disable it and get failure on
> >> the second.
> >>
> >> Description of case we see:
> >> 1. openib mca variables are registered during startup as stage at
> >> select component phase;
> >> 2. but a winner is cm component and openib mca variables are
> >> deregistered as part of mca group;
> >> 3. mca variables are not removed from global mca array but they
> >> marked as invalid and memory for string is freed;
> >> 4. shmem needs openib for yoda and does bml initialization;
> >> 5. openib mca variables are registered againusing light mode as
> >> searching itself in global array and refreshing their fields
> >> again;
> >>
> >> Can you explain what you mean by step 5? I.e., what does "using light
> >> mode" mean? Is the openib component register function invoked again?
> >>
> >> It is correct, it is called twice. "light mode" means that
> >> mca_base_var_register() does not allocate mca variable object again, it
> >> seeks this variable in global array and finding it updates fields in
> >> mca_base_var_t structure (at least mbv_storage).
> >>
> >> 6. for unknown reason bml finalization does not clean these vars as
> >> it is done in step 2;
> >> 7. mca_btl_openib.so is unloaded;
> >> 8. opal_finalize() destroys mca variables form global array,
> >> observes openib`s variable, try destroy using non accessed
> >> address;
> >>
> >> So a code that is under discussion fixes step 6.
> >>
> >> Nathan: it sounds like an MCA var (and entire group) is registered,
> >> unregistered, and then registered again. Does the MCA var system get
> >> confused here when it tries to unregister the group a 2nd time?
> >>
> >> Probably issue relates incorrect recognition if variable valid/invalid
> >> during second call of mca_base_var_deregister().
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pgp-signature attachment: stored