On Mon, Nov/13/2006 10:56:06AM, Josh Hursey wrote:
> On Nov 13, 2006, at 10:27 AM, Ethan Mallove wrote:
> >I can infer that you have an MPI Install section labeled
> >"odin 64 bit gcc". A few questions:
> >* What is the mpi_get for that section (or how does that
> > parameter get filled in by your automated scripts)?
> I attached the generated INI file for you to look at.
> It is the same value for all parallel runs of GCC+64bit (same value
> for all branches)
> >* Do you start with a fresh scratch tree every run?
> Yep. Every run, and all of the parallel runs.
> >* Could you email me your scratch/installs/mpi_installs.xml
> > files?
> <mpi_get simple_section_name="ompi-nightly-trunk">
> <mpi_version version="1.3a1r12559">
> <mpi_install simple_section_name="odin 64 bit gcc"
> configure_arguments="FCFLAGS=-m64 FFLAGS=-m64 CFLAGS=-m64 CXXFLAGS=-m64 --with-wrapper-cflags=-m64 --with-wrapper-cxxflags=-m64 --with-wrapper-fflags=-m64 --with-wrapper-fcflags=-m64"
> full_section_name="mpi install: odin 64 bit gcc"
> mpi_details="Open MPI"
> mpi_get_full_section_name="mpi get: ompi-nightly-trunk"
> vpath_mode="none" />
> The attached mpi_installs.xml is from the trunk+gcc+64bit parallel
> scratch directory.
> >I checked on how widespread this issue is, and found that
> >18,700 out of 474,000 Test Run rows in the past month have a
> >mpi_version/command (v1.2-trunk) mismatch. Occuring in both
> >directions (version=1.2, command=trunk and vice versa).
> >They occur on these clusters:
> > Cisco MPI development cluster
> > IU Odin
> > IU - Thor - TESTING
> >There *is* that race condition in which one mtt submitting
> >could overwrite another's index. Do you have "trunk" and
> >"1.2" runs submitting to the database at the same time?
> Yes we do. :(
> The parallel blocks as we call them are separate scratch directories
> in which MTT is running concurrently. Meaning that we have N parallel
> block scratch directories each running one instance of MTT. So it is
> possible (and highly likely) that when the reporter phase fires all
> of the N parallel blocks are firing it about the same time.
Likely, because the mtt runs start at the same time? Or because you
do the [Reporter:IU database] section for trunk/v1.2 at the same time?
> Without knowing how the reporter is doing the inserts into the
> database I don't think I can help much more than that on debugging.
> When the reporter fires for the DB:
> - Does it start a transaction for the connection, do the inserts,
> then commit?
> - Does it ship the inserts to the server then allow it to run them,
> or does the client do all of the individual inserts?
lib/MTT/Reporter/MTTDatabase.pm HTTP POSTs the results to
server/php/submit/index.php. index.php iterates over all of
the results and INSERTs them one at a time, but for each
result it checks to see if that MPI Install (hardware, os,
mpi_version, ...) is already in the database. If it is, it
reuses that existing row, otherwise it creates a new row
(and the problem is the SELECT/INSERT is not atomic on that
I'm having a tough time tracking down the race condition in
the postgres logs, so I'm going to change that index to a
serial type now, and see if the problem goes away.
> -- Josh
> >On Sun, Nov/12/2006 06:04:17PM, Jeff Squyres (jsquyres) wrote:
> >> I feel somewhat better now. Ethan - can you fix?
> >> -----Original Message-----
> >> From: Tim Mattox [mailto:timattox_at_[hidden]]
> >> Sent: Sunday, November 12, 2006 05:34 PM Eastern Standard Time
> >> To: General user list for the MPI Testing Tool
> >> Subject: [MTT users] Corrupted MTT database or
> >>incorrucet query
> >> Hello,
> >> I just noticed that the MTT summary page is presenting
> >> incorrect information for our recent runs at IU. It is
> >> showing failures for the 1.2b1 that actaully came from
> >> the trunk! See the first entry in this table:
> >> http://www.open-mpi.org/mtt/reporter.php?
> >> 6-11-12%2019:12:02%20through%202006-11-12%
> >> est_case&go=Table&maf_agg_timestamp=-
> >> platform_id=All&agg_platform_id=off&1-
> >> ks
> >> Click on the [i] in the upper right (the first entry)
> >> to get the popup window which shows the MPIRrun cmd as:
> >> mpirun -mca btl tcp,sm,self -np 6 --prefix
> >> /san/homedirs/mpiteam/mtt-runs/odin/20061112-Testing-NOCLN/
> >> ock-3/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/
> >> dynamic/spawn Note the path has "1.3a1r12559" in the
> >> name... it's a run from the trunk, yet the table showed
> >> this as a 1.2b1 run. There are several of these
> >> missattributed errors. This would explain why Jeff saw
> >> some ddt errors on the 1.2 brach yesterday, but was
> >> unable to reproduce them. They were from the trunk!
> >> --
> >> Tim Mattox - http://homepage.mac.com/tmattox/
> >> tmattox_at_[hidden] || timattox_at_[hidden]
> >> I'm a bright... http://www.the-brights.net/
> >> _______________________________________________
> >> mtt-users mailing list
> >> mtt-users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
> >> 1. mailto:timattox_at_[hidden]
> >> 2. http://homepage.mac.com/tmattox/
> >> 3. http://www.the-brights.net/
> >> 4. http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
> >>mtt-users mailing list
> >mtt-users mailing list
> Josh Hursey
> mtt-users mailing list