From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-06-29 21:02:18


> -----Original Message-----
> From: mtt-users-bounces_at_[hidden]
> [mailto:mtt-users-bounces_at_[hidden]] On Behalf Of Andrew Friedley
> Sent: Thursday, June 29, 2006 1:06 PM
> To: General user list for the MPI Testing Tool
> Subject: Re: [MTT users] Test output to perfbase
>
> > Which efficiency? Uploads? Database storage? Querying?
>
> Primarily database - each run is stored in postgres as a table in a
> database. Fields that vary are stored as rows in the
> database - one row
> has all the varying fields i.e. each field is a column. I think
> non-varying fields are stored once as a row in a special,
> separate table.

Ok, let's take a concrete example (and our first target): the intel test
suite. There's hundreds of tests that each return "pass" or "fail".
Specifically: these are correctness tests (MPICH and IBM tests fall into
this category as well).

If we lump all the intel tests all into a single "run", don't we have to
have a column for each intel test, and therefore have a specific input
.xml for the intel tests? If so, then I agree that a single run should
be all the intel tests, because we'd want to have a value for every
column.

But I thought that the plan for correctness tests was to have *one* XML
that would work for any correctness test (intel, IBM, MPICH, ...). This
XML would have a bunch of identifier fields (some fixed for the test
suite, like the architecture, OS, etc., and others that are varying,
such as the individual test name, etc.) and then a "pass/fail" field.
Hence, a single "run" would be a single test and its main output data is
"pass/fail" (and probably stdout/stderr). This is certainly the way
that inp_test_run.xml is written right now (and this is why I'm confused
that you're saying something different).

Specifically -- if a "run" is a single row, the way that
inp_test_run.xml is currently structured, I don't see any difference in
whether we submit 1 or multiple results at the same time -- they're
still going to be multiple rows in the database.

> > In a conversation with Sun, it turns out that we both want
> to have the
> > ability to see partial results (e.g., running the entire
> Intel suite may
> > take many hours -- it would be good to be able to see results
> > more-or-less as they occur). Is there a technical issue that would
> > prevent submitting 1 (or small batches of) result(s) at a time?
>
> I think we're getting into the realm of 'too much' here. Both the

Why? There is certainly [a lot of] value in being able to check on the
intermediate status of a test suite if the entire test suite takes many
hours to run. Keep in mind that we're only using this for nightly runs
in the beginning -- it's likely that MTT is going to run *more* than
nightly at times (e.g., during release cycles). Starting a run at 8am
and then checking on it at 10am can certainly be useful (even if it's
got 6 more hours to run).

> current design and especially your new proposed design are batch
> oriented, not stream oriented. Heck, MTT in general.

MTT does large groups of tasks at a given time, but it can report at any
time (as it does now).

> This is doable, but when we start hammering/scaling this
> system, getting
> as much information as possible in a perfbase run is going to be very
> important. I remember Brian agreeing with me, tens/hundreds of
> thousands of tables in postgres is a bad idea.

Based on inp_test_run.xml, I don't see how this implies any more tables.
 
> >>>1. send all results from the above intel run in a single
> >>submit (i.e.,
> >>>all tcp and all openib results). Since we submit the MCA
> >>params as part
> >>>of the data, our queries later can distinguish tcp vs. openib data.
> >>This is what I want. We can easily construct queries to only
> >
> > So I guess I'm still not clear on *why* you want this. :-) Can you
> > specify the reasons?
>
> Well, I think having a test suite run with all its variations
> interpreted as a single perfbase run makes sense. We could certainly
> draw the line elsewhere, but I think its appropriate that a
> test suite
> run with a particular mpi install on a particular system makes a
> suitable base unit. It matches both the MTT and perfbase
> architectures
> well - we can support this easily in MTT, it scales well in perfbase,
> doesn't compromise our query ability, and just plain gets the
> job done.

I think we can make MTT do whatever we want it to do (indeed, right now,
it reports each individual test, so the argument of "get the job done"
could work more in my favor than yours ;-) ). And I think perfbase will
do whatever we define it to do as well -- indeed, the way that
int_mpi_test.xml is, I think we have defined it to take individual
correctness tests.
 
> > Yes, that information (tcp vs. openib) is in one of the
> fields that we
> > send back (it has to be, otherwise the results are somewhat
> > meaningless). It's not a standalone "btl" field, though --
> it's more of
> > a "here's the MCA parameters that were specified" field.
> So queries for
> > tcp results will probably need to search for "tcp" in the
> MCA parameters
> > field.
> >
> > But this is the same issue regardless of whether we submit
> 1 result at a
> > time or all at once, right? I guess I don't see the difference for
> > selecting "tcp" vs. "openib" results based on whether we
> submit 1 result
> > at a time or all at once -- can you clarify? I think I
> must be missing
> > something...
>
> I think you missed what I was saying - picking which BTL was used for
> any kind of storage differentiation just seems completely
> arbitrary to
> me. Not only that, its open mpi specific, or do we not care
> about being
> MPI agnostic any more?

The "here's the MCA parameters that were specified" field is the mpirun
command line. So this is MPI agnostic.
 
> You're right in that it doesn't have much/anything to do with
> the split
> submission issue.

Ok.
 
> > If this is all possible, then -- at least in my mind -- I
> don't see a
> > reason why multiple submits vs. a single submit is *required*.
> > Obviously, multiple submits will take more bandwidth than a single
> > submit -- but I see that as an optimization that we can [and should]
> > work out later. Specifically: reducing the bandwidth of
> submits doesn't
> > need to be in the initial version since our primary,
> immediate goal is
> > to get this functional, running nightly tests, and sending out test
> > results in the morning as long as the current, unoptimized bandwidth
> > requirements are not too onerous on milliways.
>
> Well, the direction I was going with the server side was that a test
> suite run would be in one HTTP POST. As far as I'm concerned its a
> matter of writing code to do it differently, and how soon you
> want this
> to work.

Ah -- you weren't clued in on what happened in MA last week. :-)

Because MTT has been identified as a critical deficiency to the project,
Sun is volunteering 2 engineers to work on MTT starting in the immediate
future and get it finished and working for the whole group. Hence, MTT
is finally no longer a "spare time" project.

I'm doing some final cleanup of some things that I was aware of and
ramping them up in the code base so that they can take over and take MTT
to completion and we can all start using it. We got SSH accounts for
Ethan and Anya on milliways today so that they can access perfbase,
write web pages, etc.

I was given 2 weeks by my management (this week and next week) for this
task, which is why you've seen commits from me and why I've been asking
questions about the server side of MTT. :-)

Specifically, I have finished most of the client-side work that I wanted
to get done. The next critical tasks that need to get done are (in
order):

- audit the fields being sent between the client and the server for all
the relevant phases and ensure they match (Brian found some problems in
this area)
- implement whatever client-side/server-side is necessary/missing (e.g.,
compression, fragmenting/reassembly, etc.) to deal with max apache
upload size issues
- ensure that all submissions are getting all the way from running in
the MTT client to being stored in perfbase
- write a simple query to show failures from the last 24 hours and send
an e-mail with the results
- complete the "trim" phase code

Sun is on a company-wide annual shutdown all of next week, which works
perfectly with my timescale. When they return, Ethan and Anya will pick
up where I left off. Once the above list is complete, they'll continue
on and do all the other stuff in MTT (e.g., writing query web pages for
more interesting reports than just showing the failures from the last 24
hours, adding more test suite modules, adding more client features,
etc.).

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems