Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite
From: Gilles Gouaillardet (gilles.gouaillardet_at_[hidden])
Date: 2014-05-27 21:11:50


Ralph,

in the case of intercomm_create, the children free all the communicators
and then MPI_Disconnect() and then MPI_Finalize() and exits.
the parent only MPI_Disconnect() without freeing all the communicators.
MPI_Finalize() tries to disconnect and communicate with already exited
processes.

my understanding is that there are two ways of seeing things :
a) the "R-way" : the problem is the parent should not try to communicate to
already exited processes
b) the "J-way" : the problem is the children should have waited either in
MPI_Comm_free() or MPI_Finalize()

i did not investigate the loop_spawn test yet, and will do today.

as far as i am concerned, i have no opinion on which of a) or b) is the
correct/most appropriate approach.

Cheers,

Gilles

On Wed, May 28, 2014 at 9:46 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> Since you ignored my response, I'll reiterate and clarify it here. The
> problem in the case of loop_spawn is that the parent process remains
> "connected" to children after the child has finalized and died. Hence, when
> the parent attempts to finalize, it tries to "disconnect" itself from
> processes that no longer exist - and that is what generates the error
> message.
>
> So the issue in that case appears to be that "finalize" is not marking the
> child process as "disconnected", thus leaving the parent thinking that it
> needs to disconnect when it finally ends.
>
>
> On May 27, 2014, at 5:33 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]>
> wrote:
>
> > Note that MPI says that COMM_DISCONNECT simply disconnects that
> individual communicator. It does *not* guarantee that the processes
> involved will be fully disconnected.
> >
> > So I think that the freeing of communicators is good app behavior, but
> it is not required by the MPI spec.
> >
> > If OMPI is requiring this for correct termination, then something is
> wrong. MPI_FINALIZE is supposed to be collective across all connected MPI
> procs -- and if the parent and spawned procs in this test are still
> connected (because they have not disconnected all communicators between
> them), the FINALIZE is supposed to be collective across all of them.
> >
> > This means that FINALIZE is allowed to block if it needs to, such that
> OMPI sending control messages to procs that are still "connected" (in the
> MPI sense) should never cause a race condition.
> >
> > As such, this sounds like an OMPI bug.
> >
> >
> >
> >
> > On May 27, 2014, at 2:27 AM, Gilles Gouaillardet <
> gilles.gouaillardet_at_[hidden]> wrote:
> >
> >> Folks,
> >>
> >> currently, the dynamic/intercomm_create test from the ibm test suite
> output the following messages :
> >>
> >> dpm_base_disconnect_init: error -12 in isend to process 1
> >>
> >> the root cause it task 0 tries to send messages to already exited tasks.
> >>
> >> one way of seeing things is that this is an application issue :
> >> task 0 should have MPI_Comm_free'd all its communicator before calling
> MPI_Comm_disconnect.
> >> This can be achieved via the attached patch
> >>
> >> an other way of seeing things is that this is a bug in OpenMPI.
> >> In this case, what would be the the right approach ?
> >> - automatically free communicators (if needed) when MPI_Comm_disconnect
> is invoked ?
> >> - simply remove communicators (if needed) from ompi_mpi_communicators
> when MPI_Comm_disconnect is invoked ?
> >> /* this causes a memory leak, but the application can be seen as
> responsible of it */
> >> - other ?
> >>
> >> Thanks in advance for your feedback,
> >>
> >> Gilles
> >> <intercomm_create.patch>_______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14847.php
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14875.php
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14876.php
>