On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :-
> ( I can't reproduce them manually, but they seem to only happen in a
> very small fraction of overall MTT runs. I'm seeing at least 3
> classes of errors:
>
> 1. btl_sm_add_procs.c:529 which is this:
>
> if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
> NULL) {
>
> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank]
> appears to have a valid value in it (i.e., .fifo[3][0] = x, .fifo[3]
> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x+3*offset.
> But gdb says:
>
> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
> Cannot access memory at address 0x2a96b73050
>
Bah -- this is a red herring; this memory is in the shared memory
segment, and that memory is not saved in the corefile. So of course
gdb can't access it (I just did a short controlled test and proved
this to myself).
But I don't understand why I would have a bunch of tests that all segv
at btl_sm_add_procs.c:529. :-(
--
Jeff Squyres
Cisco Systems
|