Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Devel mailing list

Subject: Re: [MTT devel] MTT email timeout notification feature
From: Ethan Mallove (ethan.mallove_at_[hidden])
Date: 2009-06-25 11:23:35


In r1300, I incorporated your notes, except for #2 and #3. I dug
around for a CPAN module for stack traces, but couldn't find anything.
Maybe MTT could contain a stripped down version of GDB for the sole
purpose of gathering stack traces? Or is there some other open source
"stack trace grabber" tool out there?

-Ethan

On Mon, Jun/22/2009 07:37:16AM, Jeff Squyres wrote:
> Actually, I think this would be fine for the trunk. Some random notes:
>
> 1. It might be nice to move this logic out of the docommand sub itself and
> into its own sub.
> 2. it would also be good to generalize the ps and gdb commands for systems
> where those variants are not relevant
> 3. it might even be good to generally develop the backtrace functionality
> overall -- backtraces would be really good to capture in the database...
> 4. how about having a[n optional] timeout with the sentinel file? that is,
> it'll send a mail, then wait another timeout (e.g., 1 hour) and if the
> sentinel file still exists, mtt will remove the file and keep going
>
>
> On Jun 19, 2009, at 2:47 PM, Ethan Mallove wrote:
>
>> Folks,
>>
>> I came up with a feature, which does not seem quite appropriate to go
>> into the MTT trunk, but is still possibly useful for someone other
>> than me. I have posted a note about it on the MTT wiki:
>>
>> http://svn.open-mpi.org/trac/mtt/wiki/EmailTimeoutNotification
>>
>> Here's the text of the Wiki page:
>>
>> We (Sun) were trying to track down a hang in an MPI test that we were
>> seeing in our MTT runs which was difficult to reproduce manually. The
>> problem is that MTT kills the hanging process before a developer has a
>> chance to investigate the issue. To address this, I patched an MTT
>> client (see attached patch file) to send out a notification email
>> containing an mpirun command line and GDB back trace for the hanging
>> test. A predefined sentinel file is touched, which can later be
>> removed to force MTT to move on and continue testing. Here are the INI
>> parameters to activate the timeout email notification:
>>
>> * {{{docommand_timeout_sentinel_file}}}
>> * {{{docommand_timeout_email_recipient}}}
>>
>> Example usage:
>>
>> {{{
>> $ client/mtt --scratch /foo/bar --file foo.ini
>>
>> docommand_timeout_sentinel_file=/tmp/mtt-timeout-sentinel-file-\&random_string\(10\)
>> docommand_timeout_email_recipient=fred.flintsone_at_[hidden],barney.rubble_at_[hidden]
>> }}}
>>
>> -Ethan
>> _______________________________________________
>> mtt-devel mailing list
>> mtt-devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> mtt-devel mailing list
> mtt-devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel