Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-10 22:23:57


Hey Jeff

I seem to recall seeing the identical problem reported on the user
list not long ago...or may have been the devel list. Anyway, it was
during btl_sm_add_procs, and the code was segv'ing.

I don't have the archives handy here, but perhaps you might search
them and see if there is a common theme here. IIRC, some of Eugene's
fixes impacted this problem.

Ralph

On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:

> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>
>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1
>> MTT. :-
>> ( I can't reproduce them manually, but they seem to only happen in a
>> very small fraction of overall MTT runs. I'm seeing at least 3
>> classes of errors:
>>
>> 1. btl_sm_add_procs.c:529 which is this:
>>
>> if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
>> NULL) {
>>
>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank]
>> appears to have a valid value in it (i.e., .fifo[3][0] = x, .fifo[3]
>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x+3*offset.
>> But gdb says:
>>
>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>> Cannot access memory at address 0x2a96b73050
>>
>
>
> Bah -- this is a red herring; this memory is in the shared memory
> segment, and that memory is not saved in the corefile. So of course
> gdb can't access it (I just did a short controlled test and proved
> this to myself).
>
> But I don't understand why I would have a bunch of tests that all
> segv at btl_sm_add_procs.c:529. :-(
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel