On Jan 9, 2014, at 11:00 AM, Joshua Ladd <joshual_at_[hidden]> wrote:
> Hcoll uses the PML as an "OOB" to bootstrap itself. When a communicator is destroyed, by the time we destroy the hcoll module, the communicator context is no longer valid and any pending operations that rely on its existence will fail. In particular, we have a non-blocking synchronization barrier that may be in progress when the communicator is destroyed.
Can you explain this a little more? Do you mean you have a pending MPI_Ibarrier running on that communicator? (i.e., the ibarrier has started but not completed) Or you have some started-but-not-completed MPI_Isends/MPI_Irecvs?
(using the PML/coll equivalents of these of course -- not the top-level MPI_* foo functions)
Or are you saying that you need the destruction of the hcoll module on a given communicator to be synchronous between all processes in that communicator?
> Registering the delete callback allows us to finish these operations because the context is still valid inside of this callback. The commented out code is the "prototype" protocol that attempted to handle this scenario in an entirely different (and more complex) way. It is not needed now. We don't want to introduce solutions that are OMPI specific, because we need to be able to integrate hcoll into other runtimes. We considered approaching the community about changing the comm destroy flow in OMPI to keep the context alive long enough to complete our synchronization barriers, but then the solution is tied to a particular MPI
I'm not quite sure I understand -- the hcoll module (where this code is located) is completely OMPI-specific. I thought that libhcoll was your independent-of-MPI-implementations portion of this code...?
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/