Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-03-11 12:07:57


Ralph Castain wrote:
> You know, this isn't the first time we have encountered errors that
> -only- appear when running under MTT. As per my other note, we are not
> seeing these failures here, even though almost all our users run under
> batch/scripts.
>
> This has been the case with at least some of these other MTT-only
> errors as well. It can't help but make one wonder if there isn't
> something about MTT that is causing these failures to occur. It just
> seems too bizarre that a true code problem would -only- show itself
> when executing under MTT. You would think that it would have to appear
> in a script and/or batch environment as well.
>
> Just something to consider.
Ok, I actually have reproduced this error outside of MTT. But it took a
script running the same program for over a couple days. So in this
particular instance I don't believe MTT is adding any badness other than
possibly adding a load to the system.

--td
>
>
> On Mar 11, 2009, at 9:38 AM, Jeff Squyres wrote:
>
>> As Terry stated, I think this bugger is quite rare. I'm having a
>> helluva time trying to reproduce it manually (over 5k runs this
>> morning and still no segv). Ugh.
>>
>> Looking through the sm startup code, I can't see exactly what the
>> problem would be. :-(
>>
>>
>> On Mar 11, 2009, at 11:34 AM, Ralph Castain wrote:
>>
>>> I'll run some tests with 1.3.1 on one of our systems and see if it
>>> shows up there. If it is truly rare and was in 1.3.0, then I
>>> personally don't have a problem with it. Got bigger problems with
>>> hanging collectives, frankly - and we don't know how the sm changes
>>> will affect this problem, if at all.
>>>
>>>
>>> On Mar 11, 2009, at 7:50 AM, Terry Dontje wrote:
>>>
>>> > Jeff Squyres wrote:
>>> >> So -- Brad/George -- this technically isn't a regression against
>>> >> v1.3.0 (do we know if this can happen in 1.2? I don't recall
>>> >> seeing it there, but if it's so elusive... I haven't been MTT
>>> >> testing the 1.2 series in a long time). But it is a nonzero
>>> problem.
>>> >>
>>> > I have not seen 1.2 fail with this problem but I honestly don't know
>>> > if that is a fluke or not.
>>> >
>>> > --td
>>> >
>>> >> Should we release 1.3.1 without a fix?
>>> >>
>>> >
>>> >>
>>> >> On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
>>> >>
>>> >>> I actually wasn't implying that Eugene's changes -caused- the
>>> >>> problem,
>>> >>> but rather that I thought they might have -fixed- the problem.
>>> >>>
>>> >>> :-)
>>> >>>
>>> >>>
>>> >>> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
>>> >>>
>>> >>> > I forgot to mention that since I ran into this issue so long
>>> ago I
>>> >>> > really doubt that Eugene's SM changes has caused this issue.
>>> >>> >
>>> >>> > --td
>>> >>> >
>>> >>> > Terry Dontje wrote:
>>> >>> >> Hey!!! I ran into this problem many months ago but its been so
>>> >>> >> elusive that I've haven't nailed it down. First time we saw
>>> this
>>> >>> >> was last October. I did some MTT gleaning and could not find
>>> >>> >> anyone but Solaris having this issue under MTT. What's
>>> >>> interesting
>>> >>> >> is I gleaned Sun's MTT results and could not find any of these
>>> >>> >> failures as far back as last October.
>>> >>> >> What it looked like to me was that the shared memory segment
>>> >>> might
>>> >>> >> not have been initialized with 0's thus allowing one of the
>>> >>> >> processes to start accessing addresses that did not have an
>>> >>> >> appropriate address. However, when I was looking at this I was
>>> >>> >> told the mmap file was created with ftruncate which essentially
>>> >>> >> should 0 fill the memory. So I was at a loss as to why this was
>>> >>> >> happening.
>>> >>> >>
>>> >>> >> I was able to reproduce this for a little while manually
>>> >>> setting up
>>> >>> >> a script that ran and small np=2 program over and over for
>>> >>> sometime
>>> >>> >> under 3-4 days. But around November I was unable to reproduce
>>> >>> the
>>> >>> >> issue after 4 days of runs and threw up my hands until I was
>>> able
>>> >>> >> to find more failures under MTT which for Sun I haven't.
>>> >>> >>
>>> >>> >> Note that I was able to reproduce this issue with both SPARC and
>>> >>> >> Intel based platforms.
>>> >>> >>
>>> >>> >> --td
>>> >>> >>
>>> >>> >> Ralph Castain wrote:
>>> >>> >>> Hey Jeff
>>> >>> >>>
>>> >>> >>> I seem to recall seeing the identical problem reported on the
>>> >>> user
>>> >>> >>> list not long ago...or may have been the devel list. Anyway, it
>>> >>> >>> was during btl_sm_add_procs, and the code was segv'ing.
>>> >>> >>>
>>> >>> >>> I don't have the archives handy here, but perhaps you might
>>> >>> search
>>> >>> >>> them and see if there is a common theme here. IIRC, some of
>>> >>> >>> Eugene's fixes impacted this problem.
>>> >>> >>>
>>> >>> >>> Ralph
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>>> >>> >>>
>>> >>> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>>> >>> >>>>
>>> >>> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1
>>> >>> >>>>> MTT. :-
>>> >>> >>>>> ( I can't reproduce them manually, but they seem to only
>>> >>> happen
>>> >>> >>>>> in a
>>> >>> >>>>> very small fraction of overall MTT runs. I'm seeing at
>>> >>> least 3
>>> >>> >>>>> classes of errors:
>>> >>> >>>>>
>>> >>> >>>>> 1. btl_sm_add_procs.c:529 which is this:
>>> >>> >>>>>
>>> >>> >>>>> if(mca_btl_sm_component.fifo[j]
>>> >>> [my_smp_rank].head_lock !=
>>> >>> >>>>> NULL) {
>>> >>> >>>>>
>>> >>> >>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j]
>>> >>> [my_smp_rank]
>>> >>> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
>>> >>> >>>>> x, .fifo[3]
>>> >>> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
>>> >>> >>>>> +3*offset.
>>> >>> >>>>> But gdb says:
>>> >>> >>>>>
>>> >>> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>>> >>> >>>>> Cannot access memory at address 0x2a96b73050
>>> >>> >>>>>
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> Bah -- this is a red herring; this memory is in the shared
>>> >>> memory
>>> >>> >>>> segment, and that memory is not saved in the corefile. So of
>>> >>> >>>> course gdb can't access it (I just did a short controlled test
>>> >>> >>>> and proved this to myself).
>>> >>> >>>>
>>> >>> >>>> But I don't understand why I would have a bunch of tests that
>>> >>> all
>>> >>> >>>> segv at btl_sm_add_procs.c:529. :-(
>>> >>> >>>>
>>> >>> >>>> --
>>> >>> >>>> Jeff Squyres
>>> >>> >>>> Cisco Systems
>>> >>> >>>>
>>> >>> >>>> _______________________________________________
>>> >>> >>>> devel mailing list
>>> >>> >>>> devel_at_[hidden]
>>> >>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> >>> >>>
>>> >>> >>> _______________________________________________
>>> >>> >>> devel mailing list
>>> >>> >>> devel_at_[hidden]
>>> >>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>> > _______________________________________________
>>> >>> > devel mailing list
>>> >>> > devel_at_[hidden]
>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> >>>
>>> >>> _______________________________________________
>>> >>> devel mailing list
>>> >>> devel_at_[hidden]
>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> >>>
>>> >>
>>> >>
>>> >
>>> > _______________________________________________
>>> > devel mailing list
>>> > devel_at_[hidden]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel