Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-03-11 06:34:58


I forgot to mention that since I ran into this issue so long ago I
really doubt that Eugene's SM changes has caused this issue.

--td

Terry Dontje wrote:
> Hey!!! I ran into this problem many months ago but its been so
> elusive that I've haven't nailed it down. First time we saw this was
> last October. I did some MTT gleaning and could not find anyone but
> Solaris having this issue under MTT. What's interesting is I gleaned
> Sun's MTT results and could not find any of these failures as far back
> as last October.
> What it looked like to me was that the shared memory segment might not
> have been initialized with 0's thus allowing one of the processes to
> start accessing addresses that did not have an appropriate address.
> However, when I was looking at this I was told the mmap file was
> created with ftruncate which essentially should 0 fill the memory. So
> I was at a loss as to why this was happening.
>
> I was able to reproduce this for a little while manually setting up a
> script that ran and small np=2 program over and over for sometime
> under 3-4 days. But around November I was unable to reproduce the
> issue after 4 days of runs and threw up my hands until I was able to
> find more failures under MTT which for Sun I haven't.
>
> Note that I was able to reproduce this issue with both SPARC and Intel
> based platforms.
>
> --td
>
> Ralph Castain wrote:
>> Hey Jeff
>>
>> I seem to recall seeing the identical problem reported on the user
>> list not long ago...or may have been the devel list. Anyway, it was
>> during btl_sm_add_procs, and the code was segv'ing.
>>
>> I don't have the archives handy here, but perhaps you might search
>> them and see if there is a common theme here. IIRC, some of Eugene's
>> fixes impacted this problem.
>>
>> Ralph
>>
>>
>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>>
>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>>>
>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :-
>>>> ( I can't reproduce them manually, but they seem to only happen in a
>>>> very small fraction of overall MTT runs. I'm seeing at least 3
>>>> classes of errors:
>>>>
>>>> 1. btl_sm_add_procs.c:529 which is this:
>>>>
>>>> if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
>>>> NULL) {
>>>>
>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank]
>>>> appears to have a valid value in it (i.e., .fifo[3][0] = x, .fifo[3]
>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x+3*offset.
>>>> But gdb says:
>>>>
>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>>>> Cannot access memory at address 0x2a96b73050
>>>>
>>>
>>>
>>> Bah -- this is a red herring; this memory is in the shared memory
>>> segment, and that memory is not saved in the corefile. So of course
>>> gdb can't access it (I just did a short controlled test and proved
>>> this to myself).
>>>
>>> But I don't understand why I would have a bunch of tests that all
>>> segv at btl_sm_add_procs.c:529. :-(
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>