Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-28 10:10:03

On May 28, 2014, at 6:41 AM, Gilles Gouaillardet <gilles.gouaillardet_at_[hidden]> wrote:

> Ralph,
> On Wed, May 28, 2014 at 9:33 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> This is definetly what happens : only some tasks call MPI_Comm_free()
> Really? I don't see how that can happen in loop_spawn - every process is clearly calling comm_free. Or are you referring to the intercomm_create test?
> yes, i am referring intercomm_create test

kewl - thanks

> about loop_spawn, i could not get any error on my single host single socket VM.
> (i tried --mca btl tcp,sm,self and --mca btl tcp,self)
> MPI_Finalize will end up calling ompi_dpm_dyn_finalize which causes the error message on the parent of intercomm_create.
> a necessary condition is ompi_comm_num_dyncomm > 1
> /* which by the way sounds odd to me, should it be 0 ? */

That does sound odd

> which imho cannot happen if all communicators have been freed
> can you detail your full mpirun command line, the number of servers you are using, the btl involved and the ompi release that can be used to reproduce the issue ?

Running on only one server, using the current head of the svn repo. My cluster only has Ethernet, and I let it freely choose the BTLs (so I imagine the candidates are sm,self,tcp,vader). The cmd line is really trivial:

mpirun -n 1 ./loop_spawn

I modified loop_spawn to only run 100 iterations as I am not patient enough to wait for 1000, and the number of iters isn't a factor so long as it is greater than 1. When the parent calls finalize, I get one of the following emitted for every iteration that was done:

dpm_base_disconnect_init: error -12 in isend to process 0

So in other words, the parent is attempting to isend to every child that was spawned during the test - it thinks that every comm_spawn'd process remains connected to it.

I'm wondering if the issue is that the parent and child are calling comm_free, but neither side called comm_disconnect. So unless comm_free is calling disconnect under-the-covers, it might explain why the parent thinks all the children are still present.

> i will try to reproduce this myself
> Cheers,
> Gilles
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription:
> Link to this post: