Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-05-28 10:34:37


Calling MPI_Comm_free is not enough from MPI perspective to clean up
all knowledge about remote processes, nor to sever the links between
the local and remote groups. One MUST call MPI_Comm_disconnect in
order to achieve this.

Look at the code in ompi/mpi/c and see the difference between
MPI_Comm_free and MPI_Comm_disconnect. In addition to the barrier only
disconnect calls into the DPM framework, giving a chance to further
cleanup.

  George.

On Wed, May 28, 2014 at 10:10 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> On May 28, 2014, at 6:41 AM, Gilles Gouaillardet
> <gilles.gouaillardet_at_[hidden]> wrote:
>
> Ralph,
>
>
> On Wed, May 28, 2014 at 9:33 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>> This is definetly what happens : only some tasks call MPI_Comm_free()
>>
>>
>> Really? I don't see how that can happen in loop_spawn - every process is
>> clearly calling comm_free. Or are you referring to the intercomm_create
>> test?
>>
> yes, i am referring intercomm_create test
>
>
> kewl - thanks
>
>
> about loop_spawn, i could not get any error on my single host single socket
> VM.
> (i tried --mca btl tcp,sm,self and --mca btl tcp,self)
>
> MPI_Finalize will end up calling ompi_dpm_dyn_finalize which causes the
> error message on the parent of intercomm_create.
> a necessary condition is ompi_comm_num_dyncomm > 1
> /* which by the way sounds odd to me, should it be 0 ? */
>
>
> That does sound odd
>
> which imho cannot happen if all communicators have been freed
>
> can you detail your full mpirun command line, the number of servers you are
> using, the btl involved and the ompi release that can be used to reproduce
> the issue ?
>
>
> Running on only one server, using the current head of the svn repo. My
> cluster only has Ethernet, and I let it freely choose the BTLs (so I imagine
> the candidates are sm,self,tcp,vader). The cmd line is really trivial:
>
> mpirun -n 1 ./loop_spawn
>
> I modified loop_spawn to only run 100 iterations as I am not patient enough
> to wait for 1000, and the number of iters isn't a factor so long as it is
> greater than 1. When the parent calls finalize, I get one of the following
> emitted for every iteration that was done:
>
> dpm_base_disconnect_init: error -12 in isend to process 0
>
> So in other words, the parent is attempting to isend to every child that was
> spawned during the test - it thinks that every comm_spawn'd process remains
> connected to it.
>
> I'm wondering if the issue is that the parent and child are calling
> comm_free, but neither side called comm_disconnect. So unless comm_free is
> calling disconnect under-the-covers, it might explain why the parent thinks
> all the children are still present.
>
>
>
> i will try to reproduce this myself
>
> Cheers,
>
> Gilles
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14890.php
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14891.php