Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] MTT tests: segv's with sm on large messages
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-05 18:01:36


Different from what?

You and Terry saw something that was occurring about 0.01% of the time
during MPI_Init during add_procs. That does not seem to be what we are
seeing here.

But we have seen failures in 1.3.1 and 1.3.2 that look like the one
here. They occur more like 1% of the time and can occur during MPI_Init
*OR* later during a collective call. What we're looking at here seems
to be related. E.g., see
http://www.open-mpi.org/community/lists/devel/2009/03/5768.php

Jeff Squyres wrote:

> Hmm -- this looks like a different error to me.
>
> The <1% error rate sm error we were seeing was in MPI_INIT. This
> looks like it is beyond MPI_INIT and in the sending path...?
>
> On May 4, 2009, at 11:00 AM, Eugene Loh wrote:
>
>> Ralph Castain wrote:
>>
>> > In reviewing last night's MTT tests for the 1.3 branch, I am seeing
>> > several segfault failures in the shared memory BTL when using large
>> > messages. This occurred on both IU's sif machine and on Sun's tests.
>> >
>> > Here is a typical stack from MTT:
>> >
>> > MPITEST info (0): Starting MPI_Sendrecv: Root to all model test
>> > [burl-ct-v20z-13:14699] *** Process received signal ***
>> > [burl-ct-v20z-13:14699] Signal: Segmentation fault (11)
>> > [burl-ct-v20z-13:14699] Signal code: (128)
>> > [burl-ct-v20z-13:14699] Failing at address: (nil)
>> > [burl-ct-v20z-13:14699] [ 0] /lib64/tls/libpthread.so.0
>> [0x2a960bc720]
>> > [burl-ct-v20z-13:14699] [ 1]
>> /workspace/.../lib/lib64/openmpi/mca_btl_sm.so(mca_btl_sm_send+0x7b)
>> [0x2a9786a7d3]
>> > [burl-ct-v20z-13:14699] [ 2]
>> /workspace/.../lib/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x5b2)
>> [0x2a97453942]
>> > [burl-ct-v20z-13:14699] [ 3]
>> /workspace/.../lib/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x4f2)
>> [0x2a9744b446]
>> > [burl-ct-v20z-13:14699] [ 4]
>> /workspace/.../lib/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x7e)
>> [0x2a98120bca]
>> > [burl-ct-v20z-13:14699] [ 5]
>> /workspace/.../lib/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling
>> +0x119) [0x2a9812b111]
>> > [burl-ct-v20z-13:14699] [ 6]
>> /workspace/.../lib/lib64/libmpi.so.0(PMPI_Barrier +0x8e) [0x2a9584ca42]
>> > [burl-ct-v20z-13:14699] [ 7] src/MPI_Sendrecv_rtoa_c [0x403009]
>> > [burl-ct-v20z-13:14699] [ 8]
>> /lib64/tls/libc.so.6(__libc_start_main+0xea) [0x2a961e0aaa]
>> > [burl-ct-v20z-13:14699] [ 9] src/MPI_Sendrecv_rtoa_c(strtok+0x66)
>> [0x4019f2]
>> > [burl-ct-v20z-13:14699] *** End of error message ***
>> >
>> --------------------------------------------------------------------------
>> >
>> > Seems like this is something we need to address before release - yes?
>>
>> I don't know if this needs to be addressed before release, but it
>> was my
>> impression that we've been living with these errors for a long time.
>> They're intermittent (1% incidence rate????) and stacks come through
>> coll_tuned or coll_hierarch or something and end up in the sm BTL. We
>> discussed them not too long ago on this list. They predate 1.3.2. I
>> think Terry said they seem hard to reproduce outside of MTT. (Terry is
>> out this week.)
>>
>> Anyhow, my impression was that these were not new with this release.
>> Would be nice to get off the books in any case. Need to figure out how
>> to improve reproducibility and then dive into coll/sm stuff.
>