Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MTT tests: segv's with sm on large messages
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-05 18:01:36


Different from what?

You and Terry saw something that was occurring about 0.01% of the time
during MPI_Init during add_procs. That does not seem to be what we are
seeing here.

But we have seen failures in 1.3.1 and 1.3.2 that look like the one
here. They occur more like 1% of the time and can occur during MPI_Init
*OR* later during a collective call. What we're looking at here seems
to be related. E.g., see
http://www.open-mpi.org/community/lists/devel/2009/03/5768.php

Jeff Squyres wrote:

> Hmm -- this looks like a different error to me.
>
> The <1% error rate sm error we were seeing was in MPI_INIT. This
> looks like it is beyond MPI_INIT and in the sending path...?
>
> On May 4, 2009, at 11:00 AM, Eugene Loh wrote:
>
>> Ralph Castain wrote:
>>
>> > In reviewing last night's MTT tests for the 1.3 branch, I am seeing
>> > several segfault failures in the shared memory BTL when using large
>> > messages. This occurred on both IU's sif machine and on Sun's tests.
>> >
>> > Here is a typical stack from MTT:
>> >
>> > MPITEST info (0): Starting MPI_Sendrecv: Root to all model test
>> > [burl-ct-v20z-13:14699] *** Process received signal ***
>> > [burl-ct-v20z-13:14699] Signal: Segmentation fault (11)
>> > [burl-ct-v20z-13:14699] Signal code: (128)
>> > [burl-ct-v20z-13:14699] Failing at address: (nil)
>> > [burl-ct-v20z-13:14699] [ 0] /lib64/tls/libpthread.so.0
>> [0x2a960bc720]
>> > [burl-ct-v20z-13:14699] [ 1]
>> /workspace/.../lib/lib64/openmpi/mca_btl_sm.so(mca_btl_sm_send+0x7b)
>> [0x2a9786a7d3]
>> > [burl-ct-v20z-13:14699] [ 2]
>> /workspace/.../lib/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x5b2)
>> [0x2a97453942]
>> > [burl-ct-v20z-13:14699] [ 3]
>> /workspace/.../lib/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x4f2)
>> [0x2a9744b446]
>> > [burl-ct-v20z-13:14699] [ 4]
>> /workspace/.../lib/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x7e)
>> [0x2a98120bca]
>> > [burl-ct-v20z-13:14699] [ 5]
>> /workspace/.../lib/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling
>> +0x119) [0x2a9812b111]
>> > [burl-ct-v20z-13:14699] [ 6]
>> /workspace/.../lib/lib64/libmpi.so.0(PMPI_Barrier +0x8e) [0x2a9584ca42]
>> > [burl-ct-v20z-13:14699] [ 7] src/MPI_Sendrecv_rtoa_c [0x403009]
>> > [burl-ct-v20z-13:14699] [ 8]
>> /lib64/tls/libc.so.6(__libc_start_main+0xea) [0x2a961e0aaa]
>> > [burl-ct-v20z-13:14699] [ 9] src/MPI_Sendrecv_rtoa_c(strtok+0x66)
>> [0x4019f2]
>> > [burl-ct-v20z-13:14699] *** End of error message ***
>> >
>> --------------------------------------------------------------------------
>> >
>> > Seems like this is something we need to address before release - yes?
>>
>> I don't know if this needs to be addressed before release, but it
>> was my
>> impression that we've been living with these errors for a long time.
>> They're intermittent (1% incidence rate????) and stacks come through
>> coll_tuned or coll_hierarch or something and end up in the sm BTL. We
>> discussed them not too long ago on this list. They predate 1.3.2. I
>> think Terry said they seem hard to reproduce outside of MTT. (Terry is
>> out this week.)
>>
>> Anyhow, my impression was that these were not new with this release.
>> Would be nice to get off the books in any case. Need to figure out how
>> to improve reproducibility and then dive into coll/sm stuff.
>