Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Devel mailing list

Subject: Re: [MTT devel] MTT email timeout notification feature
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-06-22 07:37:16

Actually, I think this would be fine for the trunk. Some random notes:

1. It might be nice to move this logic out of the docommand sub itself
and into its own sub.
2. it would also be good to generalize the ps and gdb commands for
systems where those variants are not relevant
3. it might even be good to generally develop the backtrace
functionality overall -- backtraces would be really good to capture in
the database...
4. how about having a[n optional] timeout with the sentinel file?
that is, it'll send a mail, then wait another timeout (e.g., 1 hour)
and if the sentinel file still exists, mtt will remove the file and
keep going

On Jun 19, 2009, at 2:47 PM, Ethan Mallove wrote:

> Folks,
> I came up with a feature, which does not seem quite appropriate to go
> into the MTT trunk, but is still possibly useful for someone other
> than me. I have posted a note about it on the MTT wiki:
> Here's the text of the Wiki page:
> We (Sun) were trying to track down a hang in an MPI test that we were
> seeing in our MTT runs which was difficult to reproduce manually. The
> problem is that MTT kills the hanging process before a developer has a
> chance to investigate the issue. To address this, I patched an MTT
> client (see attached patch file) to send out a notification email
> containing an mpirun command line and GDB back trace for the hanging
> test. A predefined sentinel file is touched, which can later be
> removed to force MTT to move on and continue testing. Here are the INI
> parameters to activate the timeout email notification:
> * {{{docommand_timeout_sentinel_file}}}
> * {{{docommand_timeout_email_recipient}}}
> Example usage:
> {{{
> $ client/mtt --scratch /foo/bar --file foo.ini
> docommand_timeout_sentinel_file=/tmp/mtt-timeout-sentinel-file-
> \&random_string\(10\)
> docommand_timeout_email_recipient=fred.flintsone_at_[hidden],barney.rubble_at_[hidden]
> }}}
> -Ethan
> _______________________________________________
> mtt-devel mailing list
> mtt-devel_at_[hidden]

Jeff Squyres
Cisco Systems