Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] This is why we test
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-16 12:39:55


We fixed the openib segv, but I forgot to followup about the timeouts
that I mentioned in my original mail.

The timeouts were from poorly-configured spawn tests. That is, I had
8 cores in the job and ran the spawn test on all 8 cores (all
aggressively polling). The spawn test then spawned N more MPI
processes each of which also [attempt to] poll heavily. This causes
obvious thrashage and the test doesn't complete before the timeout.

This is obviously poorly configured tests on my part and not a real
problem (I confirmed by re-running the tests with <8 original MPI
procs). So as I mentioned in my prior mail, thumbs up for v1.3
release from my perspective.

On Jan 15, 2009, at 9:05 AM, Jeff Squyres wrote:

> Unfortunately, I have to throw the flag in the v1.3 release. :-(
>
> I ran ~16k tests via MTT yesterday on the rc5 and rc6 tarballs. I
> found the following:
>
> Found test runs: 15962
> Passed: 15785 (98.89%)
> Failed: 83 (0.52%)
> --> Openib failures: 80 (0.50%)
> Skipped: 46 (0.29%)
> Timedout: 48 (0.30%)
>
> The 80 openib failures are all seemingly random segv's. I repeated
> a much smaller run this morning (about 700 runs) and still found a
> non-zero percentage of fails of the same flavor.
>
> The timeouts are a little worrysome as well.
>
> This unfortunately requires investigation. :-(
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems