Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Users mailing list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-10-27 10:31:44

On Oct 27, 2006, at 7:39 AM, Jeff Squyres wrote:

> On Oct 25, 2006, at 10:37 AM, Josh Hursey wrote:
>> The discussion started with the bug characteristics of v1.2 versus
>> the trunk.
> Gotcha.
>> It seemed from the call that IU was the only institution that can
>> asses this via MTT as noone else spoke up. Since people were
>> interested in seeing things that were breaking I suggested that I
>> start forwarding the IU internal MTT reports (run nightly and
>> weekly) to the This was meet by Brain
>> insisting that it would result in "thousands" of emails to the
>> development list. I clarified that it is only 3 - 4 messages a day
>> from IU. However if all other institutions do this then it would be
>> a bunch of email (where 'a bunch' would still be less than
>> 'thousands'). That's how we got to a 'we need a single summary
>> presented to the group' comment. It should be noted that we brought
>> up IU sending to the 'testing_at_[hidden]' list as a bandaid until
>> MTT could do it better.
> How about sending them to me and Ethan?

Sure I can add you both to the list if you like.

>> This single summary can be email or a webpage that people can
>> check. Rich said that he would prefer a webpage, and noone else
>> really had a comment. That got us talking about the current summary
>> page that MTT generates. Tim M mentioned that the current website
>> is difficult to figure out how to get the answers you need. I
>> agree, it is hard [usability] for someone to go to the summary page
>> and answer the question "So what failed from IU last night, and how
>> does that differ from Yesterday -- e.g., what regressed and
>> progressed yesterday at IU?". The website is flexible enough to due
>> it, but having a couple of basic summary pages would be nice for
>> basic users. What that should look like we can discuss further.
> Agreed; we aren't super-fond of the current web page, either. Do you
> guys want to have a teleconf to go over the current status of MTT,
> where you want it to go, etc.? I consider IU's input here quite
> important, since you're the ones pushing the boundaries, flexing
> MTT's muscles, etc.

In my previous email I suggested a couple of questions that I would
like a webpage to answer. A teleconf might be good to talk about some
of the various items that IU is trying to do around MTT.

>> The IU group really likes the emails that we currently generate. A
>> plain-text summary of the previous run. I posted copies on the MTT
>> bug tracker here:
>> Currently we have not put the work in to aggregate the runs, so for
>> each ini file that we run we get 1 email to the IU group. This is
>> fine for the moment, but as we add the rest of the clusters and
>> dimensions in the testing matrix we will need MTT to aggregate the
>> results for us and generate such an email.
> Ok.
> We created another ticket yesterday to make a new MTT Reporter (our
> internal plugins) that duplicates this output format. It actually
> shouldn't be that hard -- we don't have to do parsing to get the
> numbers that you're reporting; we have access to the actual data. So
> it's mostly caching the data, calculating the totals that you're
> calculating, and printing in your output format.
> Ethan has some other short tasks to do before he gets to this, but
> its near the top of the priority list. You can see the current
> workflow on the wiki (this is a living document; it keeps changing as
> requirements, etc. change):

Awesome Thanks! :)

>> So I think the general feel of the discussion is that we need the
>> following from MTT:
>> - A 'basic' summary page providing answers to some general
>> frequently asked queries. The current interface is too advanced for
>> the current users.
> We have the summary.php page, but I personally have never found it
> too useful. :-)
> We're getting towards a full revamp of reporter.php (got some other
> tasks to complete first, but we're definitely starting to think about
> it) -- got any ideas / input? Our "haven't thought about it much
> yet" idea is to be more menu/Q-A driven with a few common queries
> easily available (rather than a huge, complicated single screen).

See previous email for some general ideas. Tim M might have a few
more that he would like to see since he is the one at IU that is
watching the nightly results the closest.

>> - A summary email [in plain-text preferably] similar to the one
>> that IU generated showing an aggregation of the previous nights
>> results for (a) all reporters (b) my institution [so I can track
>> them down and file bugs].
> For the moment, we don't have the dynamic capability for you to login
> to the web page, create a report, and say "mail this to me nightly".
> However, Ethan can make up custom reports on the server quite easily
> -- if you want some IU-specific reports, just file a ticket and
> Ethan
> can Make It So.

Cool. We'll talk it over and see what we would like.

>> - 1 email a day on the previous nights testing results.
> That's what we intended for the mails that are coming today, but it
> seemed to not be sufficient -- we ended up with 4 nightly mails, one
> for each relevant phase failures and a 4th for showing stderr of mpi
> installs.
>> Some relevant bugs currently in existence:
>> The other concern is that given the frequency of testing as bugs
>> appear from the testing someone needs to make sure the bug tracker
>> is updated. I think the group is unclear about how this is done.
>> Meaning when a MTT identifies a test as failed whom is responsible
>> for putting the bug in the bug tracker?
> At the moment, I've been manually examining the mails every day and
> firing off e-mails to those responsible. However, due to travel last
> week and this week, I've gotten quite behind. :-(

I wonder if there is a way to do something more automated. Probably
too advanced for MTT 2.0 or 3.0, but something to think about. Maybe
tie it in with the bug tracker, so send a "Bug Master Engineer" an
aggregated list of failures that can be easily put into TRAC. Donno..
just an idea to help take the burden off of you.

>> The obvious solution is the institution that identified the bug.
>> [Warning: My opinion] But then that becomes unwieldy for IU since
>> we have a large testing matrix, and would need to commit someone to
>> doing this everyday (and it may take all day to properly track a
>> set of bugs). Also this kind of punishes an institution for testing
>> more instead of providing incentive to test.
> True. I don't know the proper answer to this, either -- I know the
> "Jeff look at e-mail" solution doesn't scale well.
>> ------ Page Break -- Context switch ------
>> In case you all want to know what we are doing here at IU. I
>> attached to this email our planed MTT testing matrix. Currently we
>> have BigRed and Odin running the complete matrix less the BLACS
>> tests. Wotan and Thor will come online as we get more resources to
>> support them.
>> In order to do such a complex testing matrix we have various .ini
>> files that we use. And since some of the dimensions in the matrix
>> are large we break some of the tests into a couple .ini files that
>> are submitted concurrently to have them run in a reasonable time.
>> <MTT-testing-matrix.txt>
> Awesome.
> I would like to schedule some phone time with you guys and Ethan and
> me to talk about what's working, what's not working, etc. One
> obvious question I have is: is the INI config file format suitable?
> Do we need to do something more complex that would allow
> consolidation of your various configurations? ...etc.

Tim M and I spent the better part of two days revamping our current
setup to do some more 'advanced' things (Parallel builds, etc...). We
are putting all of these scripts in ompi-tests/iu/mtt in case anyone
wants to see how we are doing it and use that as an example for doing
something similar.

Basically our problems are:
  - Testing results come in at various times as they complete, we
would really like a 'status report' at 8 am every day finished or not.
  - Due to the combinatorial effect of MTT this lends itself to some
obvious parallelism. Can we harness that to reduce the time to
complete the testing cycle.
  - We will soon have 4 clusters [wotan, bigred, odin, thor] each
running 3 branches [trunk, v1.2, v1.1], 2 different builds [64 bit
gcc, 32 bit gcc] every night! That 24 sets of the nightly tests, and
we have biweekly tests in there as well :o. That means a lot of ini
files that basically say the same thing.

What we are trying to do:
  - Generalize the INI files with default sets that can be plugged in.
  - Make the scripts more general so they can be used easily across
all clusters
  - Reduce the number of emails to at most 2 from the nightly runs
per cluster [Progress, and Final] -- we are not using SLURM, LL, and
a hostlist in or runs.
  - Increase the parallelism per stage as much as possible, in as
general a way as possible.
  - 8 am (or 10 am) status report from our script to check on the run
as it goes.

We already have a list of refinements that we would like to add to
this new script setup, but those are a bit more advanced (e.g., using
a mgr/worker model to use allocations as they become available, using
a queue to order the tests into the most important, etc.)

One thing that would be nice if MTT could do, but would be initially
institution specific would be a custom trigger for aggregation from
the MTT server. The problem is that we currently get 2 emails from
each cluster every night (this does not include the weekly runs) so
that will be 8 emails a day, which can be a bit hard to parse. If we
put the aggregation code close to the server (or just had a way for
us to query the DB from the IU side via ODBC stuff) then we could
have the aggregation function generate 2 emails which include results
from all clusters. So 2 giant emails instead of 8 smaller emails.
Just an idea, but if you gave me the information so that I can send
queries to the MTT database I can mock up something and we can all
experiment with it to see if we can generalize a bit. Obviously the
'guest' user access to the DB that this aggregation function will use
would only have read access since we don't want it modifying the DB.

So generally yeah I think we would like to have a teleconf to talk
about our experiences with MTT, and what we have done around it to
fit our needs. We realize that we are pushing it a bit further than
others, so we are fine with doing a home brewed solution for a while
until MTT is able to replicate the functionality.


> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
> _______________________________________________
> mtt-users mailing list
> mtt-users_at_[hidden]

Josh Hursey