Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-28 08:30:29


You can adjust the number of iterations so the parent reaches the end - in my case, I run it in a non-managed environment, and so there is no timeout. If you run it that way, you'll see the end result when the parent attempts to finalize.

On May 27, 2014, at 11:18 PM, Gilles Gouaillardet <gilles.gouaillardet_at_[hidden]> wrote:

> Ralph,
>
> i could not find anything wrong with loop_spawn and unless i am missing something obvious :
>
> from mtt http://mtt.open-mpi.org/index.php?do_redir=2196
>
> all tests ran this month (both trunk and v1.8) failed (timeout) and there was no error message such as
> dpm_base_disconnect_init: error -12 in isend to process 1
>
> loop_spawn tries to spawn 2000 tasks in 10 minutes.
> my system is not fast enough to achieve this so the iteration count is bumped
> /* if time exceeded, then bump iteration count to the end */
>
> the test would success in 10 minutes and a few seconds ( required to complete the last spawn and MPI_Finalize())
>
> the slurm timeout is set to 10 minutes exactly, so the job is aborted before it has time to finish (and i believe it would have finished successfully)
>
> you can either increase the slurm timeout (10min30s looks good to me),
> decrease nseconds (570 looks good to me) in loop_spawn.c or run
> mpirun ... dynamic/loop_spawn <nseconds>
> where nseconds is "a bit less" than 600 seconds (once again, 570 looks good to me)
>
> did i miss something ?
>
> Cheers,
>
> Gilles
>
>
> On Wed, May 28, 2014 at 12:53 PM, Gilles Gouaillardet <gilles.gouaillardet_at_[hidden]> wrote:
> Ralph,
>
>
> On 2014/05/28 12:10, Ralph Castain wrote:
> > my understanding is that there are two ways of seeing things :
> > a) the "R-way" : the problem is the parent should not try to communicate to already exited processes
> > b) the "J-way" : the problem is the children should have waited either in MPI_Comm_free() or MPI_Finalize()
> > I don't think you can use option (b) - we can't have the children lingering around for the parent to call finalize, if I'm understanding you correctly.
> you understood me correctly.
>
> once again, i did not start investigating loop_spawn.
>
> in the case of intercomm_create, we would not run into this if the
> application had explicitly called MPI_Comm_free in the parent.
> so in this case *only*, and as explained by Jeff, b) could be an option
> to make OpenMPI happy.
> (to be blunt : if the user is not happy with children lingering around,
> he can explicitly call MPI_Comm_free before calling MPI_Comm_disconnect)
>
> i will start investigating loop_spawn from now
>
> Cheers,
>
> Gilles
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14879.php
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14881.php