Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-03-11 06:29:47


Hey!!! I ran into this problem many months ago but its been so elusive
that I've haven't nailed it down. First time we saw this was last
October. I did some MTT gleaning and could not find anyone but Solaris
having this issue under MTT. What's interesting is I gleaned Sun's MTT
results and could not find any of these failures as far back as last
October.

What it looked like to me was that the shared memory segment might not
have been initialized with 0's thus allowing one of the processes to
start accessing addresses that did not have an appropriate address.
However, when I was looking at this I was told the mmap file was created
with ftruncate which essentially should 0 fill the memory. So I was at
a loss as to why this was happening.

I was able to reproduce this for a little while manually setting up a
script that ran and small np=2 program over and over for sometime under
3-4 days. But around November I was unable to reproduce the issue after
4 days of runs and threw up my hands until I was able to find more
failures under MTT which for Sun I haven't.

Note that I was able to reproduce this issue with both SPARC and Intel
based platforms.

--td

Ralph Castain wrote:
> Hey Jeff
>
> I seem to recall seeing the identical problem reported on the user
> list not long ago...or may have been the devel list. Anyway, it was
> during btl_sm_add_procs, and the code was segv'ing.
>
> I don't have the archives handy here, but perhaps you might search
> them and see if there is a common theme here. IIRC, some of Eugene's
> fixes impacted this problem.
>
> Ralph
>
>
> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>
>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>>
>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :-
>>> ( I can't reproduce them manually, but they seem to only happen in a
>>> very small fraction of overall MTT runs. I'm seeing at least 3
>>> classes of errors:
>>>
>>> 1. btl_sm_add_procs.c:529 which is this:
>>>
>>> if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
>>> NULL) {
>>>
>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank]
>>> appears to have a valid value in it (i.e., .fifo[3][0] = x, .fifo[3]
>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x+3*offset.
>>> But gdb says:
>>>
>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>>> Cannot access memory at address 0x2a96b73050
>>>
>>
>>
>> Bah -- this is a red herring; this memory is in the shared memory
>> segment, and that memory is not saved in the corefile. So of course
>> gdb can't access it (I just did a short controlled test and proved
>> this to myself).
>>
>> But I don't understand why I would have a bunch of tests that all
>> segv at btl_sm_add_procs.c:529. :-(
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel