Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] MTT tests: segv's with sm on large messages
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-05-05 17:39:42


Hmm -- this looks like a different error to me.

The <1% error rate sm error we were seeing was in MPI_INIT. This
looks like it is beyond MPI_INIT and in the sending path...?

On May 4, 2009, at 11:00 AM, Eugene Loh wrote:

> Ralph Castain wrote:
>
> > In reviewing last night's MTT tests for the 1.3 branch, I am seeing
> > several segfault failures in the shared memory BTL when using large
> > messages. This occurred on both IU's sif machine and on Sun's tests.
> >
> > Here is a typical stack from MTT:
> >
> > MPITEST info (0): Starting MPI_Sendrecv: Root to all model test
> > [burl-ct-v20z-13:14699] *** Process received signal ***
> > [burl-ct-v20z-13:14699] Signal: Segmentation fault (11)
> > [burl-ct-v20z-13:14699] Signal code: (128)
> > [burl-ct-v20z-13:14699] Failing at address: (nil)
> > [burl-ct-v20z-13:14699] [ 0] /lib64/tls/libpthread.so.0
> [0x2a960bc720]
> > [burl-ct-v20z-13:14699] [ 1]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_btl_sm.so(mca_btl_sm_send+0x7b)
> > [0x2a9786a7d3]
> > [burl-ct-v20z-13:14699] [ 2]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x5b2)
> > [0x2a97453942]
> > [burl-ct-v20z-13:14699] [ 3]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_pml_ob1.so(mca_pml_ob1_isend+0x4f2)
> > [0x2a9744b446]
> > [burl-ct-v20z-13:14699] [ 4]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x7e)
> > [0x2a98120bca]
> > [burl-ct-v20z-13:14699] [ 5]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling
> +0x119)
> > [0x2a9812b111]
> > [burl-ct-v20z-13:14699] [ 6]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/libmpi.so.0(PMPI_Barrier
> +0x8e)
> > [0x2a9584ca42]
> > [burl-ct-v20z-13:14699] [ 7] src/MPI_Sendrecv_rtoa_c [0x403009]
> > [burl-ct-v20z-13:14699] [ 8] /lib64/tls/libc.so.6(__libc_start_main
> > +0xea) [0x2a961e0aaa]
> > [burl-ct-v20z-13:14699] [ 9] src/MPI_Sendrecv_rtoa_c(strtok+0x66)
> > [0x4019f2]
> > [burl-ct-v20z-13:14699] *** End of error message ***
> > [burl-ct-v20z-12][[13280,1],0][btl_tcp_endpoint.c:
> > 456:mca_btl_tcp_endpoint_recv_blocking] recv(13)
> > failed: Connection reset by peer (104)
> >
> --------------------------------------------------------------------------
> >
> > mpirun noticed that process rank 2 with PID 14699 on node burl-ct-
> > v20z-13 exited on signal 11
> > (Segmentation fault).
> >
> --------------------------------------------------------------------------
> >
> >
> > Seems like this is something we need to address before release -
> yes?
>
> I don't know if this needs to be addressed before release, but it
> was my
> impression that we've been living with these errors for a long time.
> They're intermittent (1% incidence rate????) and stacks come through
> coll_tuned or coll_hierarch or something and end up in the sm BTL. We
> discussed them not too long ago on this list. They predate 1.3.2. I
> think Terry said they seem hard to reproduce outside of MTT. (Terry
> is
> out this week.)
>
> Anyhow, my impression was that these were not new with this release.
> Would be nice to get off the books in any case. Need to figure out
> how
> to improve reproducibility and then dive into coll/sm stuff.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems