Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-11 11:41:03


FWIW, we have people running dozens of jobs every day with 1.3.0 built
with Intel 10.0.23 and PGI 7.2-5 compilers, using -mca btl
sm,openib,self...and have not received a single report of this failure.

This is all on Linux machines (various kernels), under both slurm and
torque environments.

Could be nobody is saying anything...but I would be surprised if -
nobody- barked at a segfault during startup.

On Mar 11, 2009, at 9:34 AM, Ralph Castain wrote:

> I'll run some tests with 1.3.1 on one of our systems and see if it
> shows up there. If it is truly rare and was in 1.3.0, then I
> personally don't have a problem with it. Got bigger problems with
> hanging collectives, frankly - and we don't know how the sm changes
> will affect this problem, if at all.
>
>
> On Mar 11, 2009, at 7:50 AM, Terry Dontje wrote:
>
>> Jeff Squyres wrote:
>>> So -- Brad/George -- this technically isn't a regression against
>>> v1.3.0 (do we know if this can happen in 1.2? I don't recall
>>> seeing it there, but if it's so elusive... I haven't been MTT
>>> testing the 1.2 series in a long time). But it is a nonzero
>>> problem.
>>>
>> I have not seen 1.2 fail with this problem but I honestly don't
>> know if that is a fluke or not.
>>
>> --td
>>
>>> Should we release 1.3.1 without a fix?
>>>
>>
>>>
>>> On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
>>>
>>>> I actually wasn't implying that Eugene's changes -caused- the
>>>> problem,
>>>> but rather that I thought they might have -fixed- the problem.
>>>>
>>>> :-)
>>>>
>>>>
>>>> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
>>>>
>>>> > I forgot to mention that since I ran into this issue so long
>>>> ago I
>>>> > really doubt that Eugene's SM changes has caused this issue.
>>>> >
>>>> > --td
>>>> >
>>>> > Terry Dontje wrote:
>>>> >> Hey!!! I ran into this problem many months ago but its been so
>>>> >> elusive that I've haven't nailed it down. First time we saw
>>>> this
>>>> >> was last October. I did some MTT gleaning and could not find
>>>> >> anyone but Solaris having this issue under MTT. What's
>>>> interesting
>>>> >> is I gleaned Sun's MTT results and could not find any of these
>>>> >> failures as far back as last October.
>>>> >> What it looked like to me was that the shared memory segment
>>>> might
>>>> >> not have been initialized with 0's thus allowing one of the
>>>> >> processes to start accessing addresses that did not have an
>>>> >> appropriate address. However, when I was looking at this I was
>>>> >> told the mmap file was created with ftruncate which essentially
>>>> >> should 0 fill the memory. So I was at a loss as to why this was
>>>> >> happening.
>>>> >>
>>>> >> I was able to reproduce this for a little while manually
>>>> setting up
>>>> >> a script that ran and small np=2 program over and over for
>>>> sometime
>>>> >> under 3-4 days. But around November I was unable to reproduce
>>>> the
>>>> >> issue after 4 days of runs and threw up my hands until I was
>>>> able
>>>> >> to find more failures under MTT which for Sun I haven't.
>>>> >>
>>>> >> Note that I was able to reproduce this issue with both SPARC and
>>>> >> Intel based platforms.
>>>> >>
>>>> >> --td
>>>> >>
>>>> >> Ralph Castain wrote:
>>>> >>> Hey Jeff
>>>> >>>
>>>> >>> I seem to recall seeing the identical problem reported on the
>>>> user
>>>> >>> list not long ago...or may have been the devel list. Anyway, it
>>>> >>> was during btl_sm_add_procs, and the code was segv'ing.
>>>> >>>
>>>> >>> I don't have the archives handy here, but perhaps you might
>>>> search
>>>> >>> them and see if there is a common theme here. IIRC, some of
>>>> >>> Eugene's fixes impacted this problem.
>>>> >>>
>>>> >>> Ralph
>>>> >>>
>>>> >>>
>>>> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>>>> >>>
>>>> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>>>> >>>>
>>>> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1
>>>> >>>>> MTT. :-
>>>> >>>>> ( I can't reproduce them manually, but they seem to only
>>>> happen
>>>> >>>>> in a
>>>> >>>>> very small fraction of overall MTT runs. I'm seeing at
>>>> least 3
>>>> >>>>> classes of errors:
>>>> >>>>>
>>>> >>>>> 1. btl_sm_add_procs.c:529 which is this:
>>>> >>>>>
>>>> >>>>> if(mca_btl_sm_component.fifo[j]
>>>> [my_smp_rank].head_lock !=
>>>> >>>>> NULL) {
>>>> >>>>>
>>>> >>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j]
>>>> [my_smp_rank]
>>>> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
>>>> >>>>> x, .fifo[3]
>>>> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
>>>> >>>>> +3*offset.
>>>> >>>>> But gdb says:
>>>> >>>>>
>>>> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>>>> >>>>> Cannot access memory at address 0x2a96b73050
>>>> >>>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> Bah -- this is a red herring; this memory is in the shared
>>>> memory
>>>> >>>> segment, and that memory is not saved in the corefile. So of
>>>> >>>> course gdb can't access it (I just did a short controlled test
>>>> >>>> and proved this to myself).
>>>> >>>>
>>>> >>>> But I don't understand why I would have a bunch of tests
>>>> that all
>>>> >>>> segv at btl_sm_add_procs.c:529. :-(
>>>> >>>>
>>>> >>>> --
>>>> >>>> Jeff Squyres
>>>> >>>> Cisco Systems
>>>> >>>>
>>>> >>>> _______________________________________________
>>>> >>>> devel mailing list
>>>> >>>> devel_at_[hidden]
>>>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> devel mailing list
>>>> >>> devel_at_[hidden]
>>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >>
>>>> >>
>>>> >
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > devel_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>