Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Devel mailing list

Subject: Re: [MTT devel] Analysis of hung jobs.
From: Ethan Mallove (ethan.mallove_at_[hidden])
Date: 2009-10-06 11:25:46


On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote:
>
> Further to the mail linked below, padb is able to perform diagnostics,
> including backtraces on hung jobs and integrates well into automated
> testing environments.

Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled
with -g)?

-Ethan

>
> The attached patch is a minimal change which should enable the
> functionality. I don't however have access to a working MTT
> installation to test this however.
>
> http://www.open-mpi.org/community/lists/mtt-devel/2009/06/0415.php
>
> This will require a HEAD version of padb, at least r273 to allow it to
> accept the pid of mpirun rather than a jobid assigned by the underlying
> resource manager.
>
> Yours,
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk

> Index: lib/MTT/DoCommand.pm
> ===================================================================
> --- lib/MTT/DoCommand.pm (revision 1322)
> +++ lib/MTT/DoCommand.pm (working copy)
> @@ -359,6 +359,7 @@
> }
> my $killed_status = undef;
> my $last_over = 0;
> + my $padb_output;
> while ($done > 0) {
> my $nfound = select($rout = $rin, undef, undef, $t);
> if (vec($rout, fileno(OUTread), 1) == 1) {
> @@ -410,6 +411,8 @@
> my $timeout_email_recipient = $MTT::Globals::Values->{docommand_timeout_notify_email};
> my $timeout_notify_timeout = $MTT::Globals::Values->{docommand_timeout_notify_timeout};
>
> + $padb_output = `padb --config-option rmgr=mpirun --full-report=$pid`;
> +
> if (defined($timeout_sentinel_file)) {
>
> # Email someone, if an email address has been specified
> @@ -493,6 +496,9 @@
> # Return an anonymous hash containing the relevant data
>
> $ret->{result_stdout} = join('', @out);
> + if ( defined $padb_output ) {
> + $ret->{result_stdout} .= "\n$padb_output";
> + }
> $ret->{result_stderr} = join('', @err),
> if (!$merge_output);
> return $ret;

> _______________________________________________
> mtt-devel mailing list
> mtt-devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel