Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-11 12:20:20


If it is that hard to replicate outside of MTT, then by all means
let's just release it - users will probably never see it.

On Mar 11, 2009, at 10:07 AM, Terry Dontje wrote:

> Ralph Castain wrote:
>> You know, this isn't the first time we have encountered errors that
>> -only- appear when running under MTT. As per my other note, we are
>> not seeing these failures here, even though almost all our users
>> run under batch/scripts.
>>
>> This has been the case with at least some of these other MTT-only
>> errors as well. It can't help but make one wonder if there isn't
>> something about MTT that is causing these failures to occur. It
>> just seems too bizarre that a true code problem would -only- show
>> itself when executing under MTT. You would think that it would have
>> to appear in a script and/or batch environment as well.
>>
>> Just something to consider.
> Ok, I actually have reproduced this error outside of MTT. But it
> took a script running the same program for over a couple days. So
> in this particular instance I don't believe MTT is adding any
> badness other than possibly adding a load to the system.
>
> --td
>>
>>
>> On Mar 11, 2009, at 9:38 AM, Jeff Squyres wrote:
>>
>>> As Terry stated, I think this bugger is quite rare. I'm having a
>>> helluva time trying to reproduce it manually (over 5k runs this
>>> morning and still no segv). Ugh.
>>>
>>> Looking through the sm startup code, I can't see exactly what the
>>> problem would be. :-(
>>>
>>>
>>> On Mar 11, 2009, at 11:34 AM, Ralph Castain wrote:
>>>
>>>> I'll run some tests with 1.3.1 on one of our systems and see if it
>>>> shows up there. If it is truly rare and was in 1.3.0, then I
>>>> personally don't have a problem with it. Got bigger problems with
>>>> hanging collectives, frankly - and we don't know how the sm changes
>>>> will affect this problem, if at all.
>>>>
>>>>
>>>> On Mar 11, 2009, at 7:50 AM, Terry Dontje wrote:
>>>>
>>>> > Jeff Squyres wrote:
>>>> >> So -- Brad/George -- this technically isn't a regression against
>>>> >> v1.3.0 (do we know if this can happen in 1.2? I don't recall
>>>> >> seeing it there, but if it's so elusive... I haven't been MTT
>>>> >> testing the 1.2 series in a long time). But it is a nonzero
>>>> problem.
>>>> >>
>>>> > I have not seen 1.2 fail with this problem but I honestly don't
>>>> know
>>>> > if that is a fluke or not.
>>>> >
>>>> > --td
>>>> >
>>>> >> Should we release 1.3.1 without a fix?
>>>> >>
>>>> >
>>>> >>
>>>> >> On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
>>>> >>
>>>> >>> I actually wasn't implying that Eugene's changes -caused- the
>>>> >>> problem,
>>>> >>> but rather that I thought they might have -fixed- the problem.
>>>> >>>
>>>> >>> :-)
>>>> >>>
>>>> >>>
>>>> >>> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
>>>> >>>
>>>> >>> > I forgot to mention that since I ran into this issue so
>>>> long ago I
>>>> >>> > really doubt that Eugene's SM changes has caused this issue.
>>>> >>> >
>>>> >>> > --td
>>>> >>> >
>>>> >>> > Terry Dontje wrote:
>>>> >>> >> Hey!!! I ran into this problem many months ago but its
>>>> been so
>>>> >>> >> elusive that I've haven't nailed it down. First time we
>>>> saw this
>>>> >>> >> was last October. I did some MTT gleaning and could not
>>>> find
>>>> >>> >> anyone but Solaris having this issue under MTT. What's
>>>> >>> interesting
>>>> >>> >> is I gleaned Sun's MTT results and could not find any of
>>>> these
>>>> >>> >> failures as far back as last October.
>>>> >>> >> What it looked like to me was that the shared memory segment
>>>> >>> might
>>>> >>> >> not have been initialized with 0's thus allowing one of the
>>>> >>> >> processes to start accessing addresses that did not have an
>>>> >>> >> appropriate address. However, when I was looking at this
>>>> I was
>>>> >>> >> told the mmap file was created with ftruncate which
>>>> essentially
>>>> >>> >> should 0 fill the memory. So I was at a loss as to why
>>>> this was
>>>> >>> >> happening.
>>>> >>> >>
>>>> >>> >> I was able to reproduce this for a little while manually
>>>> >>> setting up
>>>> >>> >> a script that ran and small np=2 program over and over for
>>>> >>> sometime
>>>> >>> >> under 3-4 days. But around November I was unable to
>>>> reproduce
>>>> >>> the
>>>> >>> >> issue after 4 days of runs and threw up my hands until I
>>>> was able
>>>> >>> >> to find more failures under MTT which for Sun I haven't.
>>>> >>> >>
>>>> >>> >> Note that I was able to reproduce this issue with both
>>>> SPARC and
>>>> >>> >> Intel based platforms.
>>>> >>> >>
>>>> >>> >> --td
>>>> >>> >>
>>>> >>> >> Ralph Castain wrote:
>>>> >>> >>> Hey Jeff
>>>> >>> >>>
>>>> >>> >>> I seem to recall seeing the identical problem reported on
>>>> the
>>>> >>> user
>>>> >>> >>> list not long ago...or may have been the devel list.
>>>> Anyway, it
>>>> >>> >>> was during btl_sm_add_procs, and the code was segv'ing.
>>>> >>> >>>
>>>> >>> >>> I don't have the archives handy here, but perhaps you might
>>>> >>> search
>>>> >>> >>> them and see if there is a common theme here. IIRC, some of
>>>> >>> >>> Eugene's fixes impacted this problem.
>>>> >>> >>>
>>>> >>> >>> Ralph
>>>> >>> >>>
>>>> >>> >>>
>>>> >>> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>>>> >>> >>>
>>>> >>> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres)
>>>> wrote:
>>>> >>> >>>>
>>>> >>> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco
>>>> 1.3.1
>>>> >>> >>>>> MTT. :-
>>>> >>> >>>>> ( I can't reproduce them manually, but they seem to only
>>>> >>> happen
>>>> >>> >>>>> in a
>>>> >>> >>>>> very small fraction of overall MTT runs. I'm seeing at
>>>> >>> least 3
>>>> >>> >>>>> classes of errors:
>>>> >>> >>>>>
>>>> >>> >>>>> 1. btl_sm_add_procs.c:529 which is this:
>>>> >>> >>>>>
>>>> >>> >>>>> if(mca_btl_sm_component.fifo[j]
>>>> >>> [my_smp_rank].head_lock !=
>>>> >>> >>>>> NULL) {
>>>> >>> >>>>>
>>>> >>> >>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j]
>>>> >>> [my_smp_rank]
>>>> >>> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
>>>> >>> >>>>> x, .fifo[3]
>>>> >>> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
>>>> >>> >>>>> +3*offset.
>>>> >>> >>>>> But gdb says:
>>>> >>> >>>>>
>>>> >>> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>>>> >>> >>>>> Cannot access memory at address 0x2a96b73050
>>>> >>> >>>>>
>>>> >>> >>>>
>>>> >>> >>>>
>>>> >>> >>>> Bah -- this is a red herring; this memory is in the shared
>>>> >>> memory
>>>> >>> >>>> segment, and that memory is not saved in the corefile.
>>>> So of
>>>> >>> >>>> course gdb can't access it (I just did a short
>>>> controlled test
>>>> >>> >>>> and proved this to myself).
>>>> >>> >>>>
>>>> >>> >>>> But I don't understand why I would have a bunch of tests
>>>> that
>>>> >>> all
>>>> >>> >>>> segv at btl_sm_add_procs.c:529. :-(
>>>> >>> >>>>
>>>> >>> >>>> --
>>>> >>> >>>> Jeff Squyres
>>>> >>> >>>> Cisco Systems
>>>> >>> >>>>
>>>> >>> >>>> _______________________________________________
>>>> >>> >>>> devel mailing list
>>>> >>> >>>> devel_at_[hidden]
>>>> >>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >>> >>>
>>>> >>> >>> _______________________________________________
>>>> >>> >>> devel mailing list
>>>> >>> >>> devel_at_[hidden]
>>>> >>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >>> >>
>>>> >>> >>
>>>> >>> >
>>>> >>> > _______________________________________________
>>>> >>> > devel mailing list
>>>> >>> > devel_at_[hidden]
>>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> devel mailing list
>>>> >>> devel_at_[hidden]
>>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > devel_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel