There was a long thread of discussion on why we must use an rte_barrier and
not an mpi_barrier during the finalize. Basically, we long as we have
connectionless unreliable BTLs we need an external mechanism to ensure
complete tear-down of the entire infrastructure. Thus, we need to rely on
an rte_barrier not because it guarantees the correctness of the code, but
because it provides enough time to all processes to flush all HPC traffic.
On Mon, Jul 21, 2014 at 1:10 PM, Yossi Etigin <yosefe_at_[hidden]> wrote:
> I see. But in branch v1.8, in 31869, Ralph reverted the commit which moved
> del_procs after the barrier:
> "Revert r31851 until we can resolve how to close these leaks without
> causing the usnic BTL to fail during disconnect of intercommunicators
> Refs #4643"
> Also, we need an rte barrier after del_procs - because otherwise rankA
> could call pml_finalize() before rankB finishes disconnecting from rankA.
> I think the order in finalize should be like this:
> 1. mpi_barrier(world)
> 2. del_procs()
> 3. rte_barrier()
> 4. pml_finalize()
> -----Original Message-----
> From: Nathan Hjelm [mailto:hjelmn_at_[hidden]]
> Sent: Monday, July 21, 2014 8:01 PM
> To: Open MPI Developers
> Cc: Yossi Etigin
> Subject: Re: [OMPI devel] barrier before calling del_procs
> I should add that it is an rte barrier and not an MPI barrier for
> technical reasons.
> On Mon, Jul 21, 2014 at 09:42:53AM -0700, Ralph Castain wrote:
> > We already have an rte barrier before del procs
> > Sent from my iPhone
> > On Jul 21, 2014, at 8:21 AM, Yossi Etigin <yosefe_at_[hidden]>
> > Hi,
> > We get occasional hangs with MTL/MXM during finalize, because a
> > synchronization is needed before calling del_procs.
> > e.g rank A may call del_procs() and disconnect from rank B, while
> rank B
> > is still working.
> > What do you think about adding an MPI barrier on COMM_WORLD before
> > calling del_procs()?
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/07/15204.php
> devel mailing list
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: