Ok, just got in to Chicago from my flight and am back online.
Mike: you are still not providing very much information. :-\
Your first mails make it seem like MTT is continuing to run, but leaving "launchers" (assumedly mpirun processes) still running, but they have no children. Which would be very weird for mpirun to do, if it has no children left. This could be both an MTT and an ORTE bug, in this case.
But your last mail seems to imply that MTT is hanging indefinitely.
Can you please provide a clear, precise description of what is happening?
FWIW: Yes, we are killing the parent first now, to give mpirun a chance to cleanup / tell remote orteds to die / kill children processes / etc. Killing the children first both doesn't test the common case of how people kill MPI processes (i.e., they kill mpirun), and it also doesn't allow mpirun to tell remote processes to die.
Do you run with --verbose output? MTT should output messages like "*** Killing mpirun with SIGTERM", and the like. Do you see timeout messages at all? I.e., is MTT not entering the timeout code at all?
On Jun 23, 2014, at 12:16 PM, Dave Goodell (dgoodell) <dgoodell_at_[hidden]> wrote:
> On Jun 23, 2014, at 8:48 AM, Mike Dubman <miked_at_[hidden]> wrote:
>> btw, i think now, when parent process is killed before child, OS makes child as "<defunct>" which stick around for good.
> The grandparent should inherit the child. If the grandparent then does not wait(2) on the child, then the child will remain a zombie / defunct. So in our specific case, this behavior will depend on what the parent process of mpirun is and whether it is waiting on child processes appropriately.
> mtt-devel mailing list
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post: http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0633.php
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/