Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-10 22:23:57


Hey Jeff

I seem to recall seeing the identical problem reported on the user
list not long ago...or may have been the devel list. Anyway, it was
during btl_sm_add_procs, and the code was segv'ing.

I don't have the archives handy here, but perhaps you might search
them and see if there is a common theme here. IIRC, some of Eugene's
fixes impacted this problem.

Ralph

On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:

> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>
>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1
>> MTT. :-
>> ( I can't reproduce them manually, but they seem to only happen in a
>> very small fraction of overall MTT runs. I'm seeing at least 3
>> classes of errors:
>>
>> 1. btl_sm_add_procs.c:529 which is this:
>>
>> if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
>> NULL) {
>>
>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank]
>> appears to have a valid value in it (i.e., .fifo[3][0] = x, .fifo[3]
>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x+3*offset.
>> But gdb says:
>>
>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>> Cannot access memory at address 0x2a96b73050
>>
>
>
> Bah -- this is a red herring; this memory is in the shared memory
> segment, and that memory is not saved in the corefile. So of course
> gdb can't access it (I just did a short controlled test and proved
> this to myself).
>
> But I don't understand why I would have a bunch of tests that all
> segv at btl_sm_add_procs.c:529. :-(
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel