Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-11 11:38:19


As Terry stated, I think this bugger is quite rare. I'm having a
helluva time trying to reproduce it manually (over 5k runs this
morning and still no segv). Ugh.

Looking through the sm startup code, I can't see exactly what the
problem would be. :-(

On Mar 11, 2009, at 11:34 AM, Ralph Castain wrote:

> I'll run some tests with 1.3.1 on one of our systems and see if it
> shows up there. If it is truly rare and was in 1.3.0, then I
> personally don't have a problem with it. Got bigger problems with
> hanging collectives, frankly - and we don't know how the sm changes
> will affect this problem, if at all.
>
>
> On Mar 11, 2009, at 7:50 AM, Terry Dontje wrote:
>
> > Jeff Squyres wrote:
> >> So -- Brad/George -- this technically isn't a regression against
> >> v1.3.0 (do we know if this can happen in 1.2? I don't recall
> >> seeing it there, but if it's so elusive... I haven't been MTT
> >> testing the 1.2 series in a long time). But it is a nonzero
> problem.
> >>
> > I have not seen 1.2 fail with this problem but I honestly don't know
> > if that is a fluke or not.
> >
> > --td
> >
> >> Should we release 1.3.1 without a fix?
> >>
> >
> >>
> >> On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
> >>
> >>> I actually wasn't implying that Eugene's changes -caused- the
> >>> problem,
> >>> but rather that I thought they might have -fixed- the problem.
> >>>
> >>> :-)
> >>>
> >>>
> >>> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
> >>>
> >>> > I forgot to mention that since I ran into this issue so long
> ago I
> >>> > really doubt that Eugene's SM changes has caused this issue.
> >>> >
> >>> > --td
> >>> >
> >>> > Terry Dontje wrote:
> >>> >> Hey!!! I ran into this problem many months ago but its been so
> >>> >> elusive that I've haven't nailed it down. First time we saw
> this
> >>> >> was last October. I did some MTT gleaning and could not find
> >>> >> anyone but Solaris having this issue under MTT. What's
> >>> interesting
> >>> >> is I gleaned Sun's MTT results and could not find any of these
> >>> >> failures as far back as last October.
> >>> >> What it looked like to me was that the shared memory segment
> >>> might
> >>> >> not have been initialized with 0's thus allowing one of the
> >>> >> processes to start accessing addresses that did not have an
> >>> >> appropriate address. However, when I was looking at this I was
> >>> >> told the mmap file was created with ftruncate which essentially
> >>> >> should 0 fill the memory. So I was at a loss as to why this
> was
> >>> >> happening.
> >>> >>
> >>> >> I was able to reproduce this for a little while manually
> >>> setting up
> >>> >> a script that ran and small np=2 program over and over for
> >>> sometime
> >>> >> under 3-4 days. But around November I was unable to reproduce
> >>> the
> >>> >> issue after 4 days of runs and threw up my hands until I was
> able
> >>> >> to find more failures under MTT which for Sun I haven't.
> >>> >>
> >>> >> Note that I was able to reproduce this issue with both SPARC
> and
> >>> >> Intel based platforms.
> >>> >>
> >>> >> --td
> >>> >>
> >>> >> Ralph Castain wrote:
> >>> >>> Hey Jeff
> >>> >>>
> >>> >>> I seem to recall seeing the identical problem reported on the
> >>> user
> >>> >>> list not long ago...or may have been the devel list. Anyway,
> it
> >>> >>> was during btl_sm_add_procs, and the code was segv'ing.
> >>> >>>
> >>> >>> I don't have the archives handy here, but perhaps you might
> >>> search
> >>> >>> them and see if there is a common theme here. IIRC, some of
> >>> >>> Eugene's fixes impacted this problem.
> >>> >>>
> >>> >>> Ralph
> >>> >>>
> >>> >>>
> >>> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
> >>> >>>
> >>> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
> >>> >>>>
> >>> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco
> 1.3.1
> >>> >>>>> MTT. :-
> >>> >>>>> ( I can't reproduce them manually, but they seem to only
> >>> happen
> >>> >>>>> in a
> >>> >>>>> very small fraction of overall MTT runs. I'm seeing at
> >>> least 3
> >>> >>>>> classes of errors:
> >>> >>>>>
> >>> >>>>> 1. btl_sm_add_procs.c:529 which is this:
> >>> >>>>>
> >>> >>>>> if(mca_btl_sm_component.fifo[j]
> >>> [my_smp_rank].head_lock !=
> >>> >>>>> NULL) {
> >>> >>>>>
> >>> >>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j]
> >>> [my_smp_rank]
> >>> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
> >>> >>>>> x, .fifo[3]
> >>> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
> >>> >>>>> +3*offset.
> >>> >>>>> But gdb says:
> >>> >>>>>
> >>> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
> >>> >>>>> Cannot access memory at address 0x2a96b73050
> >>> >>>>>
> >>> >>>>
> >>> >>>>
> >>> >>>> Bah -- this is a red herring; this memory is in the shared
> >>> memory
> >>> >>>> segment, and that memory is not saved in the corefile. So of
> >>> >>>> course gdb can't access it (I just did a short controlled
> test
> >>> >>>> and proved this to myself).
> >>> >>>>
> >>> >>>> But I don't understand why I would have a bunch of tests that
> >>> all
> >>> >>>> segv at btl_sm_add_procs.c:529. :-(
> >>> >>>>
> >>> >>>> --
> >>> >>>> Jeff Squyres
> >>> >>>> Cisco Systems
> >>> >>>>
> >>> >>>> _______________________________________________
> >>> >>>> devel mailing list
> >>> >>>> devel_at_[hidden]
> >>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> >>>
> >>> >>> _______________________________________________
> >>> >>> devel mailing list
> >>> >>> devel_at_[hidden]
> >>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> >>
> >>> >>
> >>> >
> >>> > _______________________________________________
> >>> > devel mailing list
> >>> > devel_at_[hidden]
> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>
> >>
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems