Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM init failures
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-27 09:01:56


FWIW, when I was looking into this before, the problem was definitely
during MPI_INIT. I ran out of time before being able to track it down
further, but it was definitely something during the sm startup --
during add_procs, IIRC.

It *looked* like there was some kind of bogus value in the bootstrap
shared memory segment, but I was having great difficulty tracking it
down because the corefiles that are left when the job segv's does
*not* include the shared memory segment. The problem occurred on a
line in the source code where we were accessing values in the
bootstrap shared memory segment, and that's not in the corefile. So
you can't tell exactly what went wrong. :-\

On Mar 27, 2009, at 5:05 AM, Ralph Castain wrote:

> Hmmm...Eugene, you need to be a tad less sensitive. Nobody was
> attempting to indict you or in any way attack you or your code.
>
> What I was attempting to point out is that there are a number of sm
> failures during sm init. I didn't single you out. I posted it to the
> community because (a) it is a persistent problem, as you yourself
> note, that involves code from a number of people; (b) it is something
> users have reported; and (c) it is clearly a race condition, which
> means it will be very difficult to chase down.
>
> So please stop the defensive rhetoric - we are not about assigning
> blame, but rather about getting the code to work right.
>
> Since we clearly see problems on sif, and Josh has indicated a
> willingness to help with debugging, this might be a place to start the
> investigation. If asked nicely, they might even be willing to grant
> access to the machine, if that would help.
>
> Whether or not we fix this for 1.3.2 is a community decision. At some
> point, though, we are going to have to resolve this problem.
>
> Thanks
> Ralph
>
> On Mar 26, 2009, at 11:39 PM, Eugene Loh wrote:
>
> > Ralph Castain wrote:
> >
> >> You are correct - the Sun errors are in a version prior to the
> >> insertion of the SM changes. We didn't relabel the version to
> >> 1.3.2 until -after- those changes went in, so you have to look for
> >> anything with an r number >= 20839.
> >>
> >> The sif errors are all in that group - I would suggest starting
> >> there.
> >>
> >> I suspect Josh or someone at IU could tell you the compiler. I
> >> would be very surprised if it wasn't gcc, but I don't know what
> >> version. I suspect they could even find a way to run some
> >> debugging on for you, if that would help.
> >
> > Okay, right now I'm not worried about compiler.
> >
> > My attorneys advised me not to speak to the public, but I share with
> > you this
> > prepared statement. :^)
> >
> > I don't claim my code is clean. Honestly, there was sm BTL code
> > that worried
> > me and I can't claim to have "done no worse" in the changes I made.
> > But, this
> > spate of test failures doesn't indict me. (Geez, sorry for being so
> > defensive.
> > I guess I just worry myself!)
> >
> > Let's start with the recent test results you indicated. Say,
> > http://www.open-mpi.org/mtt/index.php?do_redir=973 which shows these
> > failures:
> >
> > 143 on IU_Sif
> > 28 on Sun/Linux (row #6 at that URL, I guess, but you said 34?)
> > 3 on Sun/SunOS (row #7)
> >
> > But, I guess we agreed that the Sun/Linux and Sun/SunOS failures are
> > with 1.3.1,
> > and therefore are not attributable to single-queue changes.
> >
> > So now we look at recent history for IU_Sif. E.g.,
> > http://www.open-mpi.org/mtt/index.php?do_redir=975
> > Here is what I see:
> >
> > # MPI name MPI version MPI install Test build Test run
> > Pass Fail Pass Fail Pass
> > Fail pass:fail ratio
> > 1 ompi-nightly-trunk 1.4a1r20771 6 0 24 0 10585
> > 11 962
> > 2 ompi-nightly-trunk 1.4a1r20777 6 0 24 0 11880
> > 20 594
> > 3 ompi-nightly-trunk 1.4a1r20781 12 0 48 0 23759
> > 95 250
> > 4 ompi-nightly-trunk 1.4a1r20793 12 0 48 0 23822
> > 61 390
> > 5 ompi-nightly-trunk 1.4a1r20828 8 0 28 8 22893
> > 51 448
> > 6 ompi-nightly-trunk 1.4a1r20834 6 0 20 4 11442
> > 55 208
> > 7 ompi-nightly-trunk 1.4a1r20837 18 0 72 0 34084
> > 157 217
> > 8 ompi-nightly-trunk 1.4a1r20859 2 0 12 0 11900
> > 30 396
> > 9 ompi-nightly-trunk 1.4a1r20884 6 0 24 0 11843
> > 59 200
> > 10 ompi-nightly-v1.3 1.3.1rc5r20730 20 0 71 0 25108
> > 252 99
> > 11 ompi-nightly-v1.3 1.3.1rc5r20794 5 0 18 0 7332
> > 112 65
> > 12 ompi-nightly-v1.3 1.3.1rc5r20810 5 0 18 0 6813
> > 75 90
> > 13 ompi-nightly-v1.3 1.3.1rc5r20826 26 0 96 0 37205
> > 3108 11
> > 14 ompi-nightly-v1.3 1.3.2a1r20855 1 0 6 0 296
> > 107 2
> > 15 ompi-nightly-v1.3 1.3.2a1r20880 5 0 18 0 5825
> > 143 40
> >
> > I added that last "pass:fail ratio" column. The run you indicate
> > (row #15) indeed
> > has a dramatically low pass:fail ratio, but not *THAT* low. E.g.,
> > the first 1.3.1
> > run we see (row #10) is certainly of the same order of magnitude.
> >
> > We can compare those two revs in greater detail. I see this:
> >
> > # Suite np Pass Fail r20730
> > 1 ibm 16 0 32
> > 2 intel 16 0 123
> > 3 iu_ft_cr 16 0 3
> > 4 onesided 10 0 16
> > 5 onesided 12 0 32
> > 6 onesided 14 0 24
> > 7 onesided 16 0 22
> >
> > # Suite np Pass Fail r20880
> > 1 ibm 16 0 27
> > 2 intel 16 0 38
> > 3 iu_ft_cr 16 0 2
> > 4 onesided 2 0 10
> > 5 onesided 4 0 9
> > 6 onesided 6 0 9
> > 7 onesided 8 0 9
> > 8 onesided 10 0 9
> > 9 onesided 12 0 10
> > 10 onesided 14 0 10
> > 11 onesided 16 0 10
> >
> > To me, r20880 doesn't particularly look worse than r20730.
> >
> > We can deep dive on some of these results. I looked at the "ibm
> > np=16" and "onesided np=16"
> > results a lot. Indeed, r20880 shows lots of seg faults in
> > mca_btl_sm.so. On the other hand,
> > they don't (so far as I can tell) arise in the add_procs stack.
> > Indeed, many aren't in MPI_Init
> > at all. Some have to do with librdmacm. In any case, I seem to
> > find very much the same stack
> > traces for 20730.
> >
> > I'm still worried that my single-queue code either left a race
> > condition in the sm BTL start-up
> > or perhaps even made it worse. The recent MTT failures, however,
> > don't seem to point to that.
> > They seem to point to problems other than the intermittent segv's
> > that Jeff and Terry were
> > seeing and the data does not seem to me to indicate an increased
> > frequency with 1.3.2.
> >
> > Other opinions welcomed.
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems