Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Devel mailing list

Subject: Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594
From: Mike Dubman (miked_at_[hidden])
Date: 2014-06-25 01:19:37


Hi
sorry for incomplete description. will trace problem more closely later
next week and provide.

M

On Mon, Jun 23, 2014 at 10:13 PM, Jeff Squyres (jsquyres) <
jsquyres_at_[hidden]> wrote:

> Ok, just got in to Chicago from my flight and am back online.
>
> Mike: you are still not providing very much information. :-\
>
> Your first mails make it seem like MTT is continuing to run, but leaving
> "launchers" (assumedly mpirun processes) still running, but they have no
> children. Which would be very weird for mpirun to do, if it has no
> children left. This could be both an MTT and an ORTE bug, in this case.
>
> But your last mail seems to imply that MTT is hanging indefinitely.
>
> Can you please provide a clear, precise description of what is happening?
>
> FWIW: Yes, we are killing the parent first now, to give mpirun a chance to
> cleanup / tell remote orteds to die / kill children processes / etc.
> Killing the children first both doesn't test the common case of how people
> kill MPI processes (i.e., they kill mpirun), and it also doesn't allow
> mpirun to tell remote processes to die.
>
> Do you run with --verbose output? MTT should output messages like "***
> Killing mpirun with SIGTERM", and the like. Do you see timeout messages at
> all? I.e., is MTT not entering the timeout code at all?
>
> ...etc.
>
>
>
> On Jun 23, 2014, at 12:16 PM, Dave Goodell (dgoodell) <dgoodell_at_[hidden]>
> wrote:
>
> > On Jun 23, 2014, at 8:48 AM, Mike Dubman <miked_at_[hidden]>
> wrote:
> >
> >> btw, i think now, when parent process is killed before child, OS makes
> child as "<defunct>" which stick around for good.
> >
> > The grandparent should inherit the child. If the grandparent then does
> not wait(2) on the child, then the child will remain a zombie / defunct.
> So in our specific case, this behavior will depend on what the parent
> process of mpirun is and whether it is waiting on child processes
> appropriately.
> >
> > -Dave
> >
> > _______________________________________________
> > mtt-devel mailing list
> > mtt-devel_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0633.php
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> mtt-devel mailing list
> mtt-devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post:
> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0634.php
>