Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Devel mailing list

Subject: Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-06-25 06:15:38


Ok, thanks. In the meantime, please roll back to the v3.0.0 tag and you should be good. Sorry for the hassle. :-(

On Jun 25, 2014, at 12:19 AM, Mike Dubman <miked_at_[hidden]> wrote:

> Hi
> sorry for incomplete description. will trace problem more closely later next week and provide.
>
> M
>
>
> On Mon, Jun 23, 2014 at 10:13 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
> Ok, just got in to Chicago from my flight and am back online.
>
> Mike: you are still not providing very much information. :-\
>
> Your first mails make it seem like MTT is continuing to run, but leaving "launchers" (assumedly mpirun processes) still running, but they have no children. Which would be very weird for mpirun to do, if it has no children left. This could be both an MTT and an ORTE bug, in this case.
>
> But your last mail seems to imply that MTT is hanging indefinitely.
>
> Can you please provide a clear, precise description of what is happening?
>
> FWIW: Yes, we are killing the parent first now, to give mpirun a chance to cleanup / tell remote orteds to die / kill children processes / etc. Killing the children first both doesn't test the common case of how people kill MPI processes (i.e., they kill mpirun), and it also doesn't allow mpirun to tell remote processes to die.
>
> Do you run with --verbose output? MTT should output messages like "*** Killing mpirun with SIGTERM", and the like. Do you see timeout messages at all? I.e., is MTT not entering the timeout code at all?
>
> ...etc.
>
>
>
> On Jun 23, 2014, at 12:16 PM, Dave Goodell (dgoodell) <dgoodell_at_[hidden]> wrote:
>
> > On Jun 23, 2014, at 8:48 AM, Mike Dubman <miked_at_[hidden]> wrote:
> >
> >> btw, i think now, when parent process is killed before child, OS makes child as "<defunct>" which stick around for good.
> >
> > The grandparent should inherit the child. If the grandparent then does not wait(2) on the child, then the child will remain a zombie / defunct. So in our specific case, this behavior will depend on what the parent process of mpirun is and whether it is waiting on child processes appropriately.
> >
> > -Dave
> >
> > _______________________________________________
> > mtt-devel mailing list
> > mtt-devel_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> > Link to this post: http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0633.php
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> mtt-devel mailing list
> mtt-devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post: http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0634.php
>
> _______________________________________________
> mtt-devel mailing list
> mtt-devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post: http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0637.php

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/