Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Michael Kluskens (mkluskens_at_[hidden])
Date: 2006-04-26 15:25:41


Correction on this, this problem only occurs (with OpenMPI 1.2) when
I don't use mpirun to launch my process.

I know seems strange to most mpi users, it turns out that when using
OpenMPI and only needing one process (because I spawn everything else
I need), I had found it quicker just to launch the executable directly.

I have only confirmed my test code works with OpenMPI 1.2 (if I have
trouble I'll test 1.1), below is the proper output for my test of
spawning, disconnecting, and respawning:

>mpirun -np 1 parent2
parent: 0 of 1
parent: How many processes total?
2
parent: Calling MPI_Comm_spawn to start 1 subprocesses.
child starting
parent returned from Comm_Spawn call
parent: Calling MPI_BCAST with btest = 17 . child = 3
child 0 of 1: Parent 3
parent: Calling MPI_Comm_spawn to start 1 subprocesses.
child 0 of 1: Receiving 17 from parent
child calling COMM_FREE
child calling FINALIZE
child exiting
Maximum user memory allocated: 0
child starting
parent: Calling MPI_BCAST with btest = 17 . child = 3
child 0 of 1: Parent 3
child 0 of 1: Receiving 17 from parent
child calling COMM_FREE
child calling FINALIZE

Michael

On Apr 25, 2006, at 2:57 PM, Michael Kluskens wrote:

> I'm running OpenMPI 1.1 (v9704)and when a spawned processes exits
> the parent does not die (see previous discussions about
> 1.0.1/1.0.2); however, the next time the parent tries to spawn a
> process MPI_Comm_spawn does not return.
>
> My test output below:
>
> parent: 0 of 1
> parent: How many processes total?
> 2
> parent: Calling MPI_Comm_spawn to start 1 subprocesses.
> child starting
> parent returned from Comm_Spawn call
> parent: Calling MPI_BCAST with btest = 17 . child = 3
> child 0 of 1: Parent 3
> parent: Calling MPI_Comm_spawn to start 1 subprocesses.
> child 0 of 1: Receiving 17 from parent
> child calling COMM_FREE
> child calling FINALIZE
> child exiting
>
> Notice there is no message saying "parent returned from Comm_Spawn"
> and the parent just sits there and obviously the second set of
> processes don't get launched.
>
> Quick note on code fixes, my child process now calls MPI_COMM_FREE
> (parent,ierr) to free the communicator to the parent before
> exiting, in earlier version of 1.1 this crashed the code. I'm
> guessing this is the right thing to do, the Complete Reference book
> has an example without it and the Using MPI-2 book has a more
> detailed example with this in. In either case, I get the same
> results regardless.
>
> Background from previous discussion on this follows. It will cost
> me less to test new versions of Open MPI handling this than work
> around this issue in my project.
>
> Michael
>
> On Mar 2, 2006, at 1:55 PM, Ralph Castain wrote:
>
>> We expect to have much better support for the entire comm_spawn
>> process in the next incarnation of the RTE. I don't expect that to
>> be included in a release, however, until 1.1 (Jeff may be able to
>> give you an estimate for when that will happen).
>>
>> Jeff et al may be able to give you access to an early non-release
>> version sooner, if better comm_spawn support is a critical issue
>> and you don't mind being patient with the inevitable bugs in such
>> versions.
>>
>> Ralph
>>
>>
>> Edgar Gabriel wrote:
>>> Open MPI currently does not fully support a proper disconnection
>>> of parent and child processes. Thus, if a child dies/aborts, the
>>> parents will abort as well, despite of calling
>>> MPI_Comm_disconnect. (The new RTE will have better support for
>>> these operations, Ralph/Jeff can probably give a better estimate
>>> when this will be available.) However, what should not happen is,
>>> that if the child calls MPI_Finalize (so not a violent death but
>>> a proper shutdown), the parent goes down at the same time. Let me
>>> check that as well... Brignone, Sergio wrote:
>>>> Hi everybody, I am trying to run a master/slave set. Because of
>>>> the nature of the problem I need to start and stop (kill) some
>>>> slaves. The problem is that as soon as one of the slave dies,
>>>> the master dies also.
>
> <child2.f90>
> <parent2.f90>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users