From: Ethan Mallove (ethan.mallove_at_[hidden])
Date: 2006-11-13 13:19:40


On Mon, Nov/13/2006 10:56:06AM, Josh Hursey wrote:
>
> On Nov 13, 2006, at 10:27 AM, Ethan Mallove wrote:
>
> >I can infer that you have an MPI Install section labeled
> >"odin 64 bit gcc". A few questions:
> >
> >* What is the mpi_get for that section (or how does that
> > parameter get filled in by your automated scripts)?
>
> I attached the generated INI file for you to look at.

>
> It is the same value for all parallel runs of GCC+64bit (same value
> for all branches)
>
>
> >* Do you start with a fresh scratch tree every run?
>
> Yep. Every run, and all of the parallel runs.
>
> >* Could you email me your scratch/installs/mpi_installs.xml
> > files?
>

> <mpi_installs>
> <mpi_get simple_section_name="ompi-nightly-trunk">
> <mpi_version version="1.3a1r12559">
> <mpi_install simple_section_name="odin 64 bit gcc"
> append_path=""
> bindir="/san/homedirs/mpiteam/mtt-runs/odin/20061112-Nightly/parallel-block-1/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/install/bin"
> c_bindings="1"
> compiler_name="gnu"
> compiler_version="3.4.6"
> configure_arguments="FCFLAGS=-m64 FFLAGS=-m64 CFLAGS=-m64 CXXFLAGS=-m64 --with-wrapper-cflags=-m64 --with-wrapper-cxxflags=-m64 --with-wrapper-fflags=-m64 --with-wrapper-fcflags=-m64"
> cxx_bindings="1"
> f77_bindings="1"
> f90_bindings="1"
> full_section_name="mpi install: odin 64 bit gcc"
> installdir="/san/homedirs/mpiteam/mtt-runs/odin/20061112-Nightly/parallel-block-1/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/install"
> libdir="/san/homedirs/mpiteam/mtt-runs/odin/20061112-Nightly/parallel-block-1/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/install/lib"
> merge_stdout_stderr="1"
> mpi_details="Open MPI"
> mpi_get_full_section_name="mpi get: ompi-nightly-trunk"
> mpi_get_simple_section_name="ompi-nightly-trunk"
> mpi_version="1.3a1r12559"
> prepend_path=""
> result_message="Success"
> setenv=""
> success="1"
> test_status="installed"
> timestamp="1163316821"
> unsetenv=""
> vpath_mode="none" />
> </mpi_version>
> </mpi_get>
> </mpi_installs>

> The attached mpi_installs.xml is from the trunk+gcc+64bit parallel
> scratch directory.
>
> >
> >I checked on how widespread this issue is, and found that
> >18,700 out of 474,000 Test Run rows in the past month have a
> >mpi_version/command (v1.2-trunk) mismatch. Occuring in both
> >directions (version=1.2, command=trunk and vice versa).
> >They occur on these clusters:
> >
> > Cisco MPI development cluster
> > IU Odin
> > IU - Thor - TESTING
> >
>
> Interesting...
>
> >There *is* that race condition in which one mtt submitting
> >could overwrite another's index. Do you have "trunk" and
> >"1.2" runs submitting to the database at the same time?
>
> Yes we do. :(
>
> The parallel blocks as we call them are separate scratch directories
> in which MTT is running concurrently. Meaning that we have N parallel
> block scratch directories each running one instance of MTT. So it is
> possible (and highly likely) that when the reporter phase fires all
> of the N parallel blocks are firing it about the same time.
>

Likely, because the mtt runs start at the same time? Or because you
do the [Reporter:IU database] section for trunk/v1.2 at the same time?

> Without knowing how the reporter is doing the inserts into the
> database I don't think I can help much more than that on debugging.
> When the reporter fires for the DB:
> - Does it start a transaction for the connection, do the inserts,
> then commit?
> - Does it ship the inserts to the server then allow it to run them,
> or does the client do all of the individual inserts?
>

lib/MTT/Reporter/MTTDatabase.pm HTTP POSTs the results to
server/php/submit/index.php. index.php iterates over all of
the results and INSERTs them one at a time, but for each
result it checks to see if that MPI Install (hardware, os,
mpi_version, ...) is already in the database. If it is, it
reuses that existing row, otherwise it creates a new row
(and the problem is the SELECT/INSERT is not atomic on that
index).

I'm having a tough time tracking down the race condition in
the postgres logs, so I'm going to change that index to a
serial type now, and see if the problem goes away.

> -- Josh
>
> >
> >
> >On Sun, Nov/12/2006 06:04:17PM, Jeff Squyres (jsquyres) wrote:
> >>
> >> I feel somewhat better now. Ethan - can you fix?
> >> -----Original Message-----
> >> From: Tim Mattox [[1]mailto:timattox_at_[hidden]]
> >> Sent: Sunday, November 12, 2006 05:34 PM Eastern Standard Time
> >> To: General user list for the MPI Testing Tool
> >> Subject: [MTT users] Corrupted MTT database or
> >>incorrucet query
> >> Hello,
> >> I just noticed that the MTT summary page is presenting
> >> incorrect information for our recent runs at IU. It is
> >> showing failures for the 1.2b1 that actaully came from
> >> the trunk! See the first entry in this table:
> >> http://www.open-mpi.org/mtt/reporter.php?
> >>&maf_start_test_timestamp=200
> >> 6-11-12%2019:12:02%20through%202006-11-12%
> >>2022:12:02&ft_platform_id=co
> >>
> >>ntains&tf_platform_id=IU&maf_phase=runs&maf_success=fail&by_atom=*by_
> >>t
> >> est_case&go=Table&maf_agg_timestamp=-
> >>&mef_mpi_name=All&mef_mpi_version
> >>
> >>=All&mef_os_name=All&mef_os_version=All&mef_platform_hardware=All&mef
> >>_
> >> platform_id=All&agg_platform_id=off&1-
> >>page=off&no_bookmarks&no_bookmar
> >> ks
> >> Click on the [i] in the upper right (the first entry)
> >> to get the popup window which shows the MPIRrun cmd as:
> >> mpirun -mca btl tcp,sm,self -np 6 --prefix
> >> /san/homedirs/mpiteam/mtt-runs/odin/20061112-Testing-NOCLN/
> >>parallel-bl
> >> ock-3/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/
> >>install
> >> dynamic/spawn Note the path has "1.3a1r12559" in the
> >> name... it's a run from the trunk, yet the table showed
> >> this as a 1.2b1 run. There are several of these
> >> missattributed errors. This would explain why Jeff saw
> >> some ddt errors on the 1.2 brach yesterday, but was
> >> unable to reproduce them. They were from the trunk!
> >> --
> >> Tim Mattox - [2]http://homepage.mac.com/tmattox/
> >> tmattox_at_[hidden] || timattox_at_[hidden]
> >> I'm a bright... [3]http://www.the-brights.net/
> >> _______________________________________________
> >> mtt-users mailing list
> >> mtt-users_at_[hidden]
> >> [4]http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
> >>
> >>References
> >>
> >> 1. mailto:timattox_at_[hidden]
> >> 2. http://homepage.mac.com/tmattox/
> >> 3. http://www.the-brights.net/
> >> 4. http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
> >
> >>_______________________________________________
> >>mtt-users mailing list
> >>mtt-users_at_[hidden]
> >>http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
> >
> >
> >--
> >-Ethan
> >_______________________________________________
> >mtt-users mailing list
> >mtt-users_at_[hidden]
> >http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>
> ----
> Josh Hursey
> jjhursey_at_[hidden]
> http://www.open-mpi.org/
>

> _______________________________________________
> mtt-users mailing list
> mtt-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

-- 
-Ethan