Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] barrier before calling del_procs
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-07-21 14:10:39


On Mon, Jul 21, 2014 at 1:41 PM, Yossi Etigin <yosefe_at_[hidden]> wrote:

> Right, but:
>
> 1. IMHO the rte_barrier in the wrong place (in the trunk)
>

In the trunk we have the rte_barrier prior to del_proc, which is what I
would have expected: quiescence the BTLs by reaching a point where
everybody agree that no more MPI messages will be exchanged, and then
delete the BTLs.

> 2. In addition to the rte_barrier, need also mpi_barrier
>
Care for providing a reasoning for this barrier? Why and where should it be
placed?

  George.

>
>
> *From:* devel [mailto:devel-bounces_at_[hidden]] *On Behalf Of *George
> Bosilca
> *Sent:* Monday, July 21, 2014 8:19 PM
> *To:* Open MPI Developers
>
> *Subject:* Re: [OMPI devel] barrier before calling del_procs
>
>
>
> There was a long thread of discussion on why we must use an rte_barrier
> and not an mpi_barrier during the finalize. Basically, we long as we have
> connectionless unreliable BTLs we need an external mechanism to ensure
> complete tear-down of the entire infrastructure. Thus, we need to rely on
> an rte_barrier not because it guarantees the correctness of the code, but
> because it provides enough time to all processes to flush all HPC traffic.
>
>
>
> George.
>
>
>
>
>
> On Mon, Jul 21, 2014 at 1:10 PM, Yossi Etigin <yosefe_at_[hidden]> wrote:
>
> I see. But in branch v1.8, in 31869, Ralph reverted the commit which moved
> del_procs after the barrier:
> "Revert r31851 until we can resolve how to close these leaks without
> causing the usnic BTL to fail during disconnect of intercommunicators
> Refs #4643"
> Also, we need an rte barrier after del_procs - because otherwise rankA
> could call pml_finalize() before rankB finishes disconnecting from rankA.
>
> I think the order in finalize should be like this:
> 1. mpi_barrier(world)
> 2. del_procs()
> 3. rte_barrier()
> 4. pml_finalize()
>
>
> -----Original Message-----
> From: Nathan Hjelm [mailto:hjelmn_at_[hidden]]
> Sent: Monday, July 21, 2014 8:01 PM
> To: Open MPI Developers
> Cc: Yossi Etigin
> Subject: Re: [OMPI devel] barrier before calling del_procs
>
> I should add that it is an rte barrier and not an MPI barrier for
> technical reasons.
>
> -Nathan
>
> On Mon, Jul 21, 2014 at 09:42:53AM -0700, Ralph Castain wrote:
> > We already have an rte barrier before del procs
> >
> > Sent from my iPhone
> > On Jul 21, 2014, at 8:21 AM, Yossi Etigin <yosefe_at_[hidden]>
> wrote:
> >
> > Hi,
> >
> >
> >
> > We get occasional hangs with MTL/MXM during finalize, because a
> global
> > synchronization is needed before calling del_procs.
> >
> > e.g rank A may call del_procs() and disconnect from rank B, while
> rank B
> > is still working.
> >
> > What do you think about adding an MPI barrier on COMM_WORLD before
> > calling del_procs()?
> >
> >
>
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/07/15204.php
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/07/15206.php
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/07/15208.php
>