Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-11 08:58:18


So -- Brad/George -- this technically isn't a regression against
v1.3.0 (do we know if this can happen in 1.2? I don't recall seeing
it there, but if it's so elusive... I haven't been MTT testing the
1.2 series in a long time). But it is a nonzero problem.

Should we release 1.3.1 without a fix?

On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:

> I actually wasn't implying that Eugene's changes -caused- the problem,
> but rather that I thought they might have -fixed- the problem.
>
> :-)
>
>
> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
>
> > I forgot to mention that since I ran into this issue so long ago I
> > really doubt that Eugene's SM changes has caused this issue.
> >
> > --td
> >
> > Terry Dontje wrote:
> >> Hey!!! I ran into this problem many months ago but its been so
> >> elusive that I've haven't nailed it down. First time we saw this
> >> was last October. I did some MTT gleaning and could not find
> >> anyone but Solaris having this issue under MTT. What's interesting
> >> is I gleaned Sun's MTT results and could not find any of these
> >> failures as far back as last October.
> >> What it looked like to me was that the shared memory segment might
> >> not have been initialized with 0's thus allowing one of the
> >> processes to start accessing addresses that did not have an
> >> appropriate address. However, when I was looking at this I was
> >> told the mmap file was created with ftruncate which essentially
> >> should 0 fill the memory. So I was at a loss as to why this was
> >> happening.
> >>
> >> I was able to reproduce this for a little while manually setting up
> >> a script that ran and small np=2 program over and over for sometime
> >> under 3-4 days. But around November I was unable to reproduce the
> >> issue after 4 days of runs and threw up my hands until I was able
> >> to find more failures under MTT which for Sun I haven't.
> >>
> >> Note that I was able to reproduce this issue with both SPARC and
> >> Intel based platforms.
> >>
> >> --td
> >>
> >> Ralph Castain wrote:
> >>> Hey Jeff
> >>>
> >>> I seem to recall seeing the identical problem reported on the user
> >>> list not long ago...or may have been the devel list. Anyway, it
> >>> was during btl_sm_add_procs, and the code was segv'ing.
> >>>
> >>> I don't have the archives handy here, but perhaps you might search
> >>> them and see if there is a common theme here. IIRC, some of
> >>> Eugene's fixes impacted this problem.
> >>>
> >>> Ralph
> >>>
> >>>
> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
> >>>
> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
> >>>>
> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1
> >>>>> MTT. :-
> >>>>> ( I can't reproduce them manually, but they seem to only happen
> >>>>> in a
> >>>>> very small fraction of overall MTT runs. I'm seeing at least 3
> >>>>> classes of errors:
> >>>>>
> >>>>> 1. btl_sm_add_procs.c:529 which is this:
> >>>>>
> >>>>> if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
> >>>>> NULL) {
> >>>>>
> >>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j]
> [my_smp_rank]
> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
> >>>>> x, .fifo[3]
> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
> >>>>> +3*offset.
> >>>>> But gdb says:
> >>>>>
> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
> >>>>> Cannot access memory at address 0x2a96b73050
> >>>>>
> >>>>
> >>>>
> >>>> Bah -- this is a red herring; this memory is in the shared memory
> >>>> segment, and that memory is not saved in the corefile. So of
> >>>> course gdb can't access it (I just did a short controlled test
> >>>> and proved this to myself).
> >>>>
> >>>> But I don't understand why I would have a bunch of tests that all
> >>>> segv at btl_sm_add_procs.c:529. :-(
> >>>>
> >>>> --
> >>>> Jeff Squyres
> >>>> Cisco Systems
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
Cisco Systems