Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-28 10:45:42


On May 28, 2014, at 7:34 AM, George Bosilca <bosilca_at_[hidden]> wrote:

> Calling MPI_Comm_free is not enough from MPI perspective to clean up
> all knowledge about remote processes, nor to sever the links between
> the local and remote groups. One MUST call MPI_Comm_disconnect in
> order to achieve this.
>
> Look at the code in ompi/mpi/c and see the difference between
> MPI_Comm_free and MPI_Comm_disconnect. In addition to the barrier only
> disconnect calls into the DPM framework, giving a chance to further
> cleanup.

Good point - however, that doesn't fix it. Changing the Comm_free calls to Comm_disconnect results in the same error messages when the parent finalizes:

Parent:
    MPI_Init( &argc, &argv);

    for (iter = 0; iter < 100; ++iter) {
        MPI_Comm_spawn(EXE_TEST, NULL, 1, MPI_INFO_NULL,
                       0, MPI_COMM_WORLD, &comm, &err);
        printf("parent: MPI_Comm_spawn #%d return : %d\n", iter, err);

        MPI_Intercomm_merge(comm, 0, &merged);
        MPI_Comm_rank(merged, &rank);
        MPI_Comm_size(merged, &size);
        printf("parent: MPI_Comm_spawn #%d rank %d, size %d\n",
               iter, rank, size);
        MPI_Comm_disconnect(&merged);
    }

    MPI_Finalize();

Child:
    MPI_Init(&argc, &argv);
    printf("Child: launch\n");
    MPI_Comm_get_parent(&parent);
    MPI_Intercomm_merge(parent, 1, &merged);
    MPI_Comm_rank(merged, &rank);
    MPI_Comm_size(merged, &size);
    printf("Child merged rank = %d, size = %d\n", rank, size);
   
    MPI_Comm_disconnect(&merged);
    MPI_Finalize();

Upon parent calling finalize:

dpm_base_disconnect_init: error -12 in isend to process 0

>
> George.
>
>
> On Wed, May 28, 2014 at 10:10 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>> On May 28, 2014, at 6:41 AM, Gilles Gouaillardet
>> <gilles.gouaillardet_at_[hidden]> wrote:
>>
>> Ralph,
>>
>>
>> On Wed, May 28, 2014 at 9:33 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>> This is definetly what happens : only some tasks call MPI_Comm_free()
>>>
>>>
>>> Really? I don't see how that can happen in loop_spawn - every process is
>>> clearly calling comm_free. Or are you referring to the intercomm_create
>>> test?
>>>
>> yes, i am referring intercomm_create test
>>
>>
>> kewl - thanks
>>
>>
>> about loop_spawn, i could not get any error on my single host single socket
>> VM.
>> (i tried --mca btl tcp,sm,self and --mca btl tcp,self)
>>
>> MPI_Finalize will end up calling ompi_dpm_dyn_finalize which causes the
>> error message on the parent of intercomm_create.
>> a necessary condition is ompi_comm_num_dyncomm > 1
>> /* which by the way sounds odd to me, should it be 0 ? */
>>
>>
>> That does sound odd
>>
>> which imho cannot happen if all communicators have been freed
>>
>> can you detail your full mpirun command line, the number of servers you are
>> using, the btl involved and the ompi release that can be used to reproduce
>> the issue ?
>>
>>
>> Running on only one server, using the current head of the svn repo. My
>> cluster only has Ethernet, and I let it freely choose the BTLs (so I imagine
>> the candidates are sm,self,tcp,vader). The cmd line is really trivial:
>>
>> mpirun -n 1 ./loop_spawn
>>
>> I modified loop_spawn to only run 100 iterations as I am not patient enough
>> to wait for 1000, and the number of iters isn't a factor so long as it is
>> greater than 1. When the parent calls finalize, I get one of the following
>> emitted for every iteration that was done:
>>
>> dpm_base_disconnect_init: error -12 in isend to process 0
>>
>> So in other words, the parent is attempting to isend to every child that was
>> spawned during the test - it thinks that every comm_spawn'd process remains
>> connected to it.
>>
>> I'm wondering if the issue is that the parent and child are calling
>> comm_free, but neither side called comm_disconnect. So unless comm_free is
>> calling disconnect under-the-covers, it might explain why the parent thinks
>> all the children are still present.
>>
>>
>>
>> i will try to reproduce this myself
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14890.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14891.php
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14892.php