Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: fix leak of bml endpoints
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2014-05-15 14:20:58


Ok figured it out. There were three problems with the del_procs code:

 1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but
    never released the reference to them (ompi_proc_all called
    OBJ_RETAIN on all the procs returned). When calling del_procs at
    finalize it should suffice to call ompi_proc_world which does not
    increment the reference count.

 2) del_procs is called BEFORE ompi_comm_finalize. This leaves the
    references to the procs from calling the pml_add_comm function. The
    fix is to reorder the calls to do omp_comm_finalize, del_procs,
    pml_finalize instead of del_procs, pml_finalize,
    ompi_comm_finalize.

 3) The check in del_procs in r2 checked for a reference count of
    1. This is incorrect. At this point there should be 2 references: 1
    from ompi_proc, and another from the add_procs. The fix is to change
    this check to look for 2. This check makes me extremely uncomforable
    as nothing will call del_procs if the reference count of a procs is
    not 2 when del_procs is called. Maybe there should be an assert
    since this is a developer error IMHO.

Committing a patch to fix all three of these issues.

-Nathan

On Thu, May 15, 2014 at 11:52:27AM -0600, Nathan Hjelm wrote:
> On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote:
> > On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote:
> > > The solution you propose here is definitively not OK. It is 1) ugly and 2) break the separation barrier that we hold dear.
> >
> > Which is why I asked :)
> >
> > > Regarding your other suggestion I don’t see any reasons not to call the delete_proc on MPI_COMM_WORLD as the last action we do before tearing down everything else.
> >
> > I spoke too soon. It looks like we *are* calling del_procs but I am not
> > seeing the call reach the bml.... I will try and track this down.
>
> /bml/btl/ .. I see what is happening. The proc reference counts are all
> larger than 1 when we call del_procs:
>
>
> [1,2]<stderr>:Deleting proc 0x7b83190 with reference count 5
> [1,1]<stderr>:Deleting proc 0x7b83180 with reference count 5
> [1,2]<stderr>:Deleting proc 0x7b832b0 with reference count 5
> [1,1]<stderr>:Deleting proc 0x7b832a0 with reference count 7
> [1,2]<stderr>:Deleting proc 0x7b83360 with reference count 7
> [1,1]<stderr>:Deleting proc 0x7b833a0 with reference count 5
> [1,0]<stderr>:Deleting proc 0x7b83190 with reference count 7
> [1,0]<stderr>:Deleting proc 0x7b83300 with reference count 5
> [1,0]<stderr>:Deleting proc 0x7b833b0 with reference count 5
>
>
> I will track that down.
>
> -Nathan

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14812.php



  • application/pgp-signature attachment: stored