Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM init failures
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-27 01:39:12


Ralph Castain wrote:

> You are correct - the Sun errors are in a version prior to the
> insertion of the SM changes. We didn't relabel the version to 1.3.2
> until -after- those changes went in, so you have to look for anything
> with an r number >= 20839.
>
> The sif errors are all in that group - I would suggest starting there.
>
> I suspect Josh or someone at IU could tell you the compiler. I would
> be very surprised if it wasn't gcc, but I don't know what version. I
> suspect they could even find a way to run some debugging on for you,
> if that would help.

Okay, right now I'm not worried about compiler.

My attorneys advised me not to speak to the public, but I share with you
this
prepared statement. :^)

I don't claim my code is clean. Honestly, there was sm BTL code that
worried
me and I can't claim to have "done no worse" in the changes I made.
But, this
spate of test failures doesn't indict me. (Geez, sorry for being so
defensive.
I guess I just worry myself!)

Let's start with the recent test results you indicated. Say,
http://www.open-mpi.org/mtt/index.php?do_redir=973 which shows these
failures:

  143 on IU_Sif
   28 on Sun/Linux (row #6 at that URL, I guess, but you said 34?)
    3 on Sun/SunOS (row #7)

But, I guess we agreed that the Sun/Linux and Sun/SunOS failures are
with 1.3.1,
and therefore are not attributable to single-queue changes.

So now we look at recent history for IU_Sif. E.g.,
http://www.open-mpi.org/mtt/index.php?do_redir=975
Here is what I see:

   # MPI name MPI version MPI install Test build Test run
                                       Pass Fail Pass Fail Pass
Fail pass:fail ratio
   1 ompi-nightly-trunk 1.4a1r20771 6 0 24 0 10585
11 962
   2 ompi-nightly-trunk 1.4a1r20777 6 0 24 0 11880
20 594
   3 ompi-nightly-trunk 1.4a1r20781 12 0 48 0 23759
95 250
   4 ompi-nightly-trunk 1.4a1r20793 12 0 48 0 23822
61 390
   5 ompi-nightly-trunk 1.4a1r20828 8 0 28 8 22893
51 448
   6 ompi-nightly-trunk 1.4a1r20834 6 0 20 4 11442
55 208
   7 ompi-nightly-trunk 1.4a1r20837 18 0 72 0 34084
157 217
   8 ompi-nightly-trunk 1.4a1r20859 2 0 12 0 11900
30 396
   9 ompi-nightly-trunk 1.4a1r20884 6 0 24 0 11843
59 200
  10 ompi-nightly-v1.3 1.3.1rc5r20730 20 0 71 0 25108
252 99
  11 ompi-nightly-v1.3 1.3.1rc5r20794 5 0 18 0 7332
112 65
  12 ompi-nightly-v1.3 1.3.1rc5r20810 5 0 18 0 6813
75 90
  13 ompi-nightly-v1.3 1.3.1rc5r20826 26 0 96 0 37205
3108 11
  14 ompi-nightly-v1.3 1.3.2a1r20855 1 0 6 0 296
107 2
  15 ompi-nightly-v1.3 1.3.2a1r20880 5 0 18 0 5825
143 40

I added that last "pass:fail ratio" column. The run you indicate (row
#15) indeed
has a dramatically low pass:fail ratio, but not *THAT* low. E.g., the
first 1.3.1
run we see (row #10) is certainly of the same order of magnitude.

We can compare those two revs in greater detail. I see this:

   # Suite np Pass Fail r20730
   1 ibm 16 0 32
   2 intel 16 0 123
   3 iu_ft_cr 16 0 3
   4 onesided 10 0 16
   5 onesided 12 0 32
   6 onesided 14 0 24
   7 onesided 16 0 22

   # Suite np Pass Fail r20880
   1 ibm 16 0 27
   2 intel 16 0 38
   3 iu_ft_cr 16 0 2
   4 onesided 2 0 10
   5 onesided 4 0 9
   6 onesided 6 0 9
   7 onesided 8 0 9
   8 onesided 10 0 9
   9 onesided 12 0 10
  10 onesided 14 0 10
  11 onesided 16 0 10

To me, r20880 doesn't particularly look worse than r20730.

We can deep dive on some of these results. I looked at the "ibm np=16"
and "onesided np=16"
results a lot. Indeed, r20880 shows lots of seg faults in
mca_btl_sm.so. On the other hand,
they don't (so far as I can tell) arise in the add_procs stack. Indeed,
many aren't in MPI_Init
at all. Some have to do with librdmacm. In any case, I seem to find
very much the same stack
traces for 20730.

I'm still worried that my single-queue code either left a race condition
in the sm BTL start-up
or perhaps even made it worse. The recent MTT failures, however, don't
seem to point to that.
They seem to point to problems other than the intermittent segv's that
Jeff and Terry were
seeing and the data does not seem to me to indicate an increased
frequency with 1.3.2.

Other opinions welcomed.