Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Devel mailing list

Subject: Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-06-23 07:18:19


There was actually quite a bit of testing before this was committed. This commit resolved a lot of hangs across multiple organizations.

Can you be more specific as to what is happening?

The prior code was killing child processes before mpirun itself, for example, which has led MTT to erroneously report that mpirun died right at the end of the timeout without being killed. This has been ongoing for many months, at a minimum.

On Jun 23, 2014, at 4:37 AM, Mike Dubman <miked_at_[hidden]> wrote:

> this commit does more harm then good.
> we experience following:
>
> - some child processes still running after timeout and mtt killed the job.
>
> before this commit - it worked fine.
> please revert and test more.
>
>
>
> On Sat, Jun 21, 2014 at 3:30 PM, MPI Team <mpiteam_at_[hidden]> wrote:
> The branch, master has been updated
> via 016088f2a0831b32ab5fd6f60f4cabe67e92e594 (commit)
> via 7fb4c6a4c9d71be127ea53bd463178510577f71f (commit)
> via 381ba177d835a54c3197d846f5a4edfc314efe27 (commit)
> via cfdd29de2012eeb7592706f00dd07a52dd48cf6b (commit)
> via 940030ca20eb1eaf256e898b83866c1cb83aca5c (commit)
> from c99ed7c7b159a2cab58a251bd7c0dad8972ff901 (commit)
>
> Those revisions listed above that are new to this repository have
> not appeared on any other notification email; so we list those
> revisions in full, below.
>
> - Log -----------------------------------------------------------------
> https://github.com/open-mpi/mtt/commit/016088f2a0831b32ab5fd6f60f4cabe67e92e594
>
> commit 016088f2a0831b32ab5fd6f60f4cabe67e92e594
> Author: Jeff Squyres <jsquyres_at_[hidden]>
> Date: Sat Jun 21 04:58:45 2014 -0700
>
> DoCommand: several fixes to kill_proc logic
>
> 1. Fix the kill(0, $pid) test to see if the process was still alive.
>
> 2. Rename _kill_proc() to _kill_proc_tree() to indicate that it's
> really killing not only the PID in question, but also all of its
> descendants.
>
> 3. In _kill_proc_tree(), change the order to kill the main PID first,
> and ''then'' kill all the descendants.
>
> The main use case is when killing mpirun: if we kill mpirun's
> descendants ''first'', mpirun will detect its childrens' deaths and
> then cleanup and exit. Later, when MTT finally gets around to killing
> mpirun, MTT will detect that mpirun is already dead and therefore emit
> a confusing "mpirun died right at end of timeout" message. This is
> misleading at best; it doesn't indicate what actually happened.
>
> However, if we kill mpirun first, it will take care of killing all of
> its descendants. MTT will therefore emit the right messages about
> killing mpirun. MTT will then redundantly try to kill a bunch of
> now-nonexistent descendant processes of mpirun, but that's ok/safe.
> We actually ''want'' this try-to-kill-mpirun's-descendants behavior to
> handle the case when mpirun is misbehaving / not cleaning up its
> descendants.
>
> 4. DoCommand() is used for more than launching mpirun, so pass down
> $argv0 so that we can print the actual command name that is being
> killed in various Verbose/Debug messages, not the hard-coded "mpirun"
> string (which, in practice, was probably almost always correct, but
> still...).
> ---
> lib/MTT/DoCommand.pm | 78 ++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 55 insertions(+), 23 deletions(-)
>
> diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm
> index 02cdb94..646ca31 100644
> --- a/lib/MTT/DoCommand.pm
> +++ b/lib/MTT/DoCommand.pm
> @@ -2,7 +2,7 @@
> #
> # Copyright (c) 2005-2006 The Trustees of Indiana University.
> # All rights reserved.
> -# Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved.
> +# Copyright (c) 2006-2014 Cisco Systems, Inc. All rights reserved.
> # Copyright (c) 2007-2008 Sun Microsystems, Inc. All rights reserved.
> # Copyright (c) 2007-2012 High Performance Computing Center Stuttgart,
> # University of Stuttgart. All rights reserved.
> @@ -57,23 +57,27 @@ sub DoCommand {
> ($time_arg, $no_execute) = @_;
> }
>
> +# This function is called for killing both mpirun and each of its
> +# descendants. We really only want to see verbose output for when we
> +# kill mpirun itself, so only show output when the caller provides us
> +# with a $argv0 value.
> sub _kill_proc_one {
> - my ($pid) = @_;
> + my ($pid, $argv0) = @_;
>
> # How long to wait after each kill()
> my $wait_time = 5;
>
> # See if the proc is alive first
> - my $kid;
> - kill(0, $pid);
> - $kid = waitpid($pid, WNOHANG);
> - return "mpirun died right at end of timeout (MTT did not have to kill it)"
> - if (-1 == $kid);
> + my $num_alive = kill(0, $pid);
> + return "$argv0 died right at end of timeout (MTT did not have to kill it)"
> + if (0 == $num_alive);
>
> # Try an easy kill
> kill("TERM", $pid);
> - Verbose("*** Killing mpirun with SIGTERM\n");
> + Verbose("*** Killing $argv0 with SIGTERM\n")
> + if (defined($argv0));
> # Give mpirun some time to cleanup before we try to reap it.
> + my $kid;
> my $i = $wait_time;
> while ($i > 0) {
> sleep(1);
> @@ -85,45 +89,74 @@ sub _kill_proc_one {
> # process no longer exists (i.e., we get -1 back from
> # waitpid), or we successfully killed it (i.e., we got the PID
> # back from waitpid).
> - return "MTT killed mpirun via SIGTERM" if (0 != $kid);
> + return "MTT killed $argv0 via SIGTERM" if (0 != $kid);
>
> --$i;
> }
> - Verbose("** Kill TERM (after $wait_time seconds) didn't work!\n");
> + Verbose("** Kill TERM (after $wait_time seconds) didn't work!\n")
> + if (defined($argv0));
>
> # That didn't work either. Try SIGINT;
> - Verbose("*** Killing mpirun with SIGINT\n");
> + Verbose("*** Killing $argv0 with SIGINT\n")
> + if (defined($argv0));
> kill("INT", $pid);
> my $i = $wait_time;
> while ($i > 0) {
> sleep(1);
> $kid = waitpid($pid, WNOHANG);
> - return "MTT killed mpirun via SIGINT" if (0 != $kid);
> + return "MTT killed $argv0 via SIGINT" if (0 != $kid);
> --$i;
> }
> - Verbose("** Kill INT (after $wait_time seconds) didn't work!\n");
> + Verbose("** Kill INT (after $wait_time seconds) didn't work!\n")
> + if (defined($argv0));
>
> # Ok, now we're mad. Be violent.
> - Verbose("*** Now I'm mad. Killing mpirun with SIGKILL\n");
> + Verbose("*** Now I'm mad. Killing $argv0 with SIGKILL\n")
> + if (defined($argv0));
> my $count = 0;
> while (1) {
> kill("KILL", $pid);
> ++$count;
> $kid = waitpid($pid, WNOHANG);
> - return "MTT killed mpirun via $count SIGKILLs" if (0 != $kid);
> - Verbose("** Kill KILL didn't work! Sleeping and trying again...\n");
> + return "MTT killed $argv0 via $count SIGKILLs" if (0 != $kid);
> + Verbose("** Kill KILL didn't work! Sleeping and trying again...\n")
> + if (defined($argv0));
> sleep(1);
> }
> }
>
>
> -sub _kill_proc {
> - my ($pid) = @_;
> - # kill the group, take the names later
> - foreach my $process (descendant_processes($pid)) {
> +sub _kill_proc_tree {
> + my ($pid, $argv0) = @_;
> +
> + # Find all descendent processes of the main PID
> + my @children = descendant_processes($pid);
> +
> + # Kill the main PID. Note that _kill_proc_one() will give the
> + # process time to react/cleanup, so there's no need for an
> + # additional delay after it returns.
> + my $ret = _kill_proc_one($pid, $argv0);
> +
> + # Now kill all the processes that descended from the base PID.
> + #
> + # Note: when mpirun is working properly (which is one of the
> + # biggest use cases for DoCommand), this is redundant -- all the
> + # children will already be dead (because killing mpirun will
> + # ensure that all descendant processes are also killed).
> + #
> + # That being said, a) DoCommand() is used to launch more than just
> + # mpirun, b) mpirun breaks sometimes and doesn't clean up its
> + # descendants, and c) it's safe to call _kill_proc_one() on a PID
> + # that is already dead.
> + foreach my $process (@children) {
> + Debug("=== killing child proc: $process->{pid}, $process->{argv0}\n");
> +
> + # Ignore the return. For example, we don't care if the child
> + # is already dead
> _kill_proc_one($process->{pid});
> }
> - return _kill_proc_one($pid);
> +
> + return $ret;
> }
>
> #--------------------------------------------------------------------------
> @@ -338,7 +371,6 @@ sub Cmd {
>
> my $tokens = _quote_escape($cmd);
>
> -
> my $pid;
> if (! $no_execute) {
>
> @@ -568,7 +600,7 @@ sub Cmd {
>
> $done = 0;
> }
> - $timeout_message = _kill_proc($pid);
> + $timeout_message = _kill_proc_tree($pid, ${$tokens}[0]);
>
> # We don't care about the exit status if we timed out
> # -- fill it with a bogus value.
>
> https://github.com/open-mpi/mtt/commit/7fb4c6a4c9d71be127ea53bd463178510577f71f
>
> commit 7fb4c6a4c9d71be127ea53bd463178510577f71f
> Author: Jeff Squyres <jsquyres_at_[hidden]>
> Date: Sat Jun 21 04:42:19 2014 -0700
>
> DoCommand: replace obscure perl code with easier-to-grok version
>
> The descendant_processes() and flatten() codes were written in a
> nicely efficient yet very difficult to understand manner. This commit
> replaces these two subs with admittedly less efficient but much easier
> to understand code.
>
> The new code also saves the first token of the command for each PID so
> that it can be used in Debug/Verbose statements (e.g., identify the
> command that is being killed, etc.).
> ---
> lib/MTT/DoCommand.pm | 63 ++++++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 49 insertions(+), 14 deletions(-)
>
> diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm
> index c3be8a8..02cdb94 100644
> --- a/lib/MTT/DoCommand.pm
> +++ b/lib/MTT/DoCommand.pm
> @@ -120,8 +120,8 @@ sub _kill_proc_one {
> sub _kill_proc {
> my ($pid) = @_;
> # kill the group, take the names later
> - foreach my $p (descendant_processes($pid)) {
> - _kill_proc_one($p);
> + foreach my $process (descendant_processes($pid)) {
> + _kill_proc_one($process->{pid});
> }
> return _kill_proc_one($pid);
> }
> @@ -527,7 +527,8 @@ sub Cmd {
> if (defined($timeout_backtrace_program) and !$got_backtrace) {
> $backtrace = "";
> if ( $timeout_before_backtrace_program ) {
> - foreach my $p (descendant_processes($pid)) {
> + foreach my $process (descendant_processes($pid)) {
> + my $p = $process->{pid};
> my $c = $timeout_before_backtrace_program;
> $c =~ s/%PID%/$p/g;
> Debug("_pre_backtrace cmd: $c\n");
> @@ -538,7 +539,8 @@ sub Cmd {
> $backtrace .= _get_backtrace($timeout_backtrace_program, $pid, $pre_pernode);
>
> if ( $timeout_after_backtrace_program ) {
> - foreach my $p (descendant_processes($pid)) {
> + foreach my $process (descendant_processes($pid)) {
> + my $p = $process->{pid};
> my $c = $timeout_after_backtrace_program;
> $c =~ s/%PID%/$p/g;
> Debug("_post_backtrace cmd: $c\n");
> @@ -738,19 +740,50 @@ sub _do_email_timeout_notification {
> close(TIMEOUT_SENTINEL_FILE);
> }
>
> -sub flatten {
> - map{ (ref($_) eq "ARRAY") ? map{flatten($_)}@$_ : $_ } @_;
> +sub find_children {
> + my ($pids, $start) = @_;
> +
> + my @ret;
> + # Scan the PID list and see if any other PIDs list $start as their
> + # PPID. If so, save that PID and then recursively look for *that*
> + # PID's children.
> + foreach my $pid (sort(keys(%{$pids}))) {
> + if ($pids->{$pid}->{ppid} == $start) {
> + push(@ret, {
> + pid => $pid,
> + argv0 => $pids->{$pid}->{argv0},
> + });
> +
> + # Add this PID's descendants to the return array
> + push(@ret, find_children($pids, $pid));
> + }
> + }
> +
> + return @ret;
> }
>
> sub descendant_processes {
> my ($base) = (@_, $$);
> - my %parentage = (`ps -eo pid,ppid` =~ /\d+/g);
> - my %reverse = map { ($_, [$_]) } %parentage;
> - while (my ($pid,$ppid) = each %parentage){
> - push @{$reverse{$ppid}}, $reverse{$pid};
> + open(PS, "ps -eo pid,ppid,cmd|") || die "Can't open ps";
> +
> + # Skip the title line
> + <PS>;
> +
> + # Read all the data lines
> + my $pids;
> + while (<PS>) {
> + # Grab the PID, PPID, and first token of the command
> + $_ =~ m/(\d+)\s+(\d+)\s+(\S+)/;
> + $pids->{$1} = {
> + pid => $1,
> + ppid => $2,
> + argv0 => $3,
> + };
> }
> - shift @{$reverse{$base}};
> - flatten($reverse{$base});
> + close(PS);
> +
> + # Find all the descendants of the $base PID
> + return find_children($pids, $base);
> }
>
> sub _get_backtrace {
> @@ -774,7 +807,8 @@ sub _get_backtrace {
>
> # Use ps -Af output to fetch the child pids,
> # and grab a stack trace from each one
> - foreach my $p (descendant_processes($pid)) {
> + foreach my $process (descendant_processes($pid)) {
> + my $p = $process->{pid};
> $gdb_cmd = "gdb - $p --command=$gdb_command_filename --batch";
> $ret .= "\n $gdb_cmd";
> $ret .= "\n" . `$gdb_cmd`;
> @@ -814,8 +848,9 @@ sub _get_backtrace {
> }
> }
> Debug("Stacktrace: base name $return_basename\n");
> - #foreach my $p (descendant_processes($pid))
> + #foreach my $process (descendant_processes($pid))
> #{
> + # my $p = $process->{pid};
> # my $gstack_cmd = "gstack $p";
> # $ret .= "\n $gstack_cmd";
> # $ret .= "\n" . `$gstack_cmd`;
>
> https://github.com/open-mpi/mtt/commit/381ba177d835a54c3197d846f5a4edfc314efe27
>
> commit 381ba177d835a54c3197d846f5a4edfc314efe27
> Author: Jeff Squyres <jsquyres_at_[hidden]>
> Date: Sat Jun 21 03:48:55 2014 -0700
>
> DoCommand: ensure to only kill the process once
>
> When the DoCommand timeout fires, MTT will kill the entire process
> tree of the launched command. However, the stdout/stderr sockets may
> still be draining, and therefore the main DoCommand loop may still
> execute more iterations. In cases like these, ensure that we don't
> try to backtrace/kill the process tree again.
>
> Also add a few more Debug statements for detailed output.
> ---
> lib/MTT/DoCommand.pm | 23 ++++++++++++++++-------
> 1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm
> index 0e253f0..c3be8a8 100644
> --- a/lib/MTT/DoCommand.pm
> +++ b/lib/MTT/DoCommand.pm
> @@ -471,6 +471,7 @@ sub Cmd {
> my $len = sysread(OUTread, $data, 99999);
> if (0 == $len) {
> vec($rin, fileno(OUTread), 1) = 0;
> + Debug("*** Child process stdout closed\n");
> --$done;
> } else {
> _append($data, $print_timestamp ? localtime() : "",
> @@ -485,6 +486,7 @@ sub Cmd {
> my $len = sysread(ERRread, $data, 99999);
> if (0 == $len) {
> vec($rin, fileno(ERRread), 1) = 0;
> + Debug("*** Child process stderr closed\n");
> --$done;
> } else {
> _append($data, $print_timestamp ? localtime() : "",
> @@ -493,11 +495,19 @@ sub Cmd {
> }
> }
>
> - # If we're running with a timeout, bail if we're past the end
> - # time
> - if (defined($end_time) && time() > $end_time) {
> + # If we're running with a timeout, check to see if a) the
> + # process is still running (i.e., stdout/stderr is still
> + # open), and b) we're past the end time.
> + if ($done > 0 && defined($end_time) && time() > $end_time) {
> my $over = time() - $end_time;
> - if ($over > $last_over) {
> +
> + # Note that we only want to backtrace/kill the process
> + # *once*. Consider: it is possible that we kill the
> + # process tree, but then stdout/stderr are still draining,
> + # and therefore we loop around here again *even though the
> + # process tree is already dead*. So put a little
> + # do-this-only-once protection in here.
> + if (!defined($timeout_message) && $over > $last_over) {
> Verbose("*** Past timeout of $timeout seconds by $over seconds\n");
>
> # Handle timeout file
> @@ -555,10 +565,8 @@ sub Cmd {
> );
>
> $done = 0;
> - $timeout_message = _kill_proc($pid);
> - } else {
> - $timeout_message = _kill_proc($pid);
> }
> + $timeout_message = _kill_proc($pid);
>
> # We don't care about the exit status if we timed out
> # -- fill it with a bogus value.
> @@ -576,6 +584,7 @@ sub Cmd {
> }
> }
> }
> + Debug("*** Child process now dead\n");
> close OUTerr;
> close OUTread
> if (!$merge_output);
>
> https://github.com/open-mpi/mtt/commit/cfdd29de2012eeb7592706f00dd07a52dd48cf6b
>
> commit cfdd29de2012eeb7592706f00dd07a52dd48cf6b
> Author: Jeff Squyres <jsquyres_at_[hidden]>
> Date: Sat Jun 21 03:36:08 2014 -0700
>
> DoCommand: remove unused variables
> ---
> lib/MTT/DoCommand.pm | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm
> index 0740b28..0e253f0 100644
> --- a/lib/MTT/DoCommand.pm
> +++ b/lib/MTT/DoCommand.pm
> @@ -445,7 +445,6 @@ sub Cmd {
> $end_time = time() + $timeout;
> Debug("Timeout: $timeout - $end_time (vs. now: " . time() . ")\n");
> }
> - my $killed_status = undef;
> my $last_over = 0;
> my $partial_out;
> my $partial_err;
> @@ -900,7 +899,6 @@ sub Win_Cmd {
> $end_time = time() + $timeout;
> Debug("Timeout: $timeout - $end_time (vs. now: " . time() . ")\n");
> }
> - my $killed_status = undef;
> my $last_over = 0;
>
> while(<OUTread>) {
>
> https://github.com/open-mpi/mtt/commit/940030ca20eb1eaf256e898b83866c1cb83aca5c
>
> commit 940030ca20eb1eaf256e898b83866c1cb83aca5c
> Author: Jeff Squyres <jsquyres_at_[hidden]>
> Date: Fri Jun 20 17:13:08 2014 -0700
>
> Diskfree.pm: fix df handle leak
>
> This left a zombied "df" process in the process table. The Perl
> runtime eventually cleaned it up, but it was weird to see a zombied df
> in ps listings.
>
> There's no point in reporting this bug upstream; according to the
> Changes file, v0.06 was published in 1998. There's been no activity
> on http://search.cpan.org/dist/Filesys-DiskFree/DiskFree.pm since
> v0.06.
> ---
> lib/Filesys/DiskFree.pm | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/lib/Filesys/DiskFree.pm b/lib/Filesys/DiskFree.pm
> index 1fa655d..41a27d7 100644
> --- a/lib/Filesys/DiskFree.pm
> +++ b/lib/Filesys/DiskFree.pm
> @@ -124,8 +124,9 @@ sub df(){
> $cmd=$self->command() or
> croak "No df command known for format ".$self->{'FORMAT'};
> open(HANDLE,"$cmd|") or croak("Cannot fork \"$cmd\": $!");
> - return $self->load(\*HANDLE);
> + my $ret = $self->load(\*HANDLE);
> close(HANDLE) or croak("Cannot df $!");
> + return $ret;
> }
>
> sub load() {
>
> -----------------------------------------------------------------------
>
> Summary of changes:
> lib/Filesys/DiskFree.pm | 3 +-
> lib/MTT/DoCommand.pm | 162 +++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 120 insertions(+), 45 deletions(-)
>
>
> hooks/post-receive
> --
> MTT
> _______________________________________________
> mtt-svn mailing list
> mtt-svn_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn
>
> _______________________________________________
> mtt-devel mailing list
> mtt-devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Searchable archives: http://www.open-mpi.org/community/lists/mtt-devel/2014/06/index.php

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/