Ok figured it out. There were three problems with the del_procs code:
1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but
never released the reference to them (ompi_proc_all called
OBJ_RETAIN on all the procs returned). When calling del_procs at
finalize it should suffice to call ompi_proc_world which does not
increment the reference count.
2) del_procs is called BEFORE ompi_comm_finalize. This leaves the
references to the procs from calling the pml_add_comm function. The
fix is to reorder the calls to do omp_comm_finalize, del_procs,
pml_finalize instead of del_procs, pml_finalize,
3) The check in del_procs in r2 checked for a reference count of
1. This is incorrect. At this point there should be 2 references: 1
from ompi_proc, and another from the add_procs. The fix is to change
this check to look for 2. This check makes me extremely uncomforable
as nothing will call del_procs if the reference count of a procs is
not 2 when del_procs is called. Maybe there should be an assert
since this is a developer error IMHO.
Committing a patch to fix all three of these issues.
On Thu, May 15, 2014 at 11:52:27AM -0600, Nathan Hjelm wrote:
> On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote:
> > On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote:
> > > The solution you propose here is definitively not OK. It is 1) ugly and 2) break the separation barrier that we hold dear.
> > Which is why I asked :)
> > > Regarding your other suggestion I donât see any reasons not to call the delete_proc on MPI_COMM_WORLD as the last action we do before tearing down everything else.
> > I spoke too soon. It looks like we *are* calling del_procs but I am not
> > seeing the call reach the bml.... I will try and track this down.
> /bml/btl/ .. I see what is happening. The proc reference counts are all
> larger than 1 when we call del_procs:
> [1,2]<stderr>:Deleting proc 0x7b83190 with reference count 5
> [1,1]<stderr>:Deleting proc 0x7b83180 with reference count 5
> [1,2]<stderr>:Deleting proc 0x7b832b0 with reference count 5
> [1,1]<stderr>:Deleting proc 0x7b832a0 with reference count 7
> [1,2]<stderr>:Deleting proc 0x7b83360 with reference count 7
> [1,1]<stderr>:Deleting proc 0x7b833a0 with reference count 5
> [1,0]<stderr>:Deleting proc 0x7b83190 with reference count 7
> [1,0]<stderr>:Deleting proc 0x7b83300 with reference count 5
> [1,0]<stderr>:Deleting proc 0x7b833b0 with reference count 5
> I will track that down.
> devel mailing list
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14812.php
- application/pgp-signature attachment: stored