Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MTT tests: segv's with sm on large messages
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-05-05 17:39:42


Hmm -- this looks like a different error to me.

The <1% error rate sm error we were seeing was in MPI_INIT. This
looks like it is beyond MPI_INIT and in the sending path...?

On May 4, 2009, at 11:00 AM, Eugene Loh wrote:

> Ralph Castain wrote:
>
> > In reviewing last night's MTT tests for the 1.3 branch, I am seeing
> > several segfault failures in the shared memory BTL when using large
> > messages. This occurred on both IU's sif machine and on Sun's tests.
> >
> > Here is a typical stack from MTT:
> >
> > MPITEST info (0): Starting MPI_Sendrecv: Root to all model test
> > [burl-ct-v20z-13:14699] *** Process received signal ***
> > [burl-ct-v20z-13:14699] Signal: Segmentation fault (11)
> > [burl-ct-v20z-13:14699] Signal code: (128)
> > [burl-ct-v20z-13:14699] Failing at address: (nil)
> > [burl-ct-v20z-13:14699] [ 0] /lib64/tls/libpthread.so.0
> [0x2a960bc720]
> > [burl-ct-v20z-13:14699] [ 1]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_btl_sm.so(mca_btl_sm_send+0x7b)
> > [0x2a9786a7d3]
> > [burl-ct-v20z-13:14699] [ 2]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x5b2)
> > [0x2a97453942]
> > [burl-ct-v20z-13:14699] [ 3]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_pml_ob1.so(mca_pml_ob1_isend+0x4f2)
> > [0x2a9744b446]
> > [burl-ct-v20z-13:14699] [ 4]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x7e)
> > [0x2a98120bca]
> > [burl-ct-v20z-13:14699] [ 5]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/openmpi/
> > mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling
> +0x119)
> > [0x2a9812b111]
> > [burl-ct-v20z-13:14699] [ 6]
> > /workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
> > testing/installs/ZCcL/install/lib/lib64/libmpi.so.0(PMPI_Barrier
> +0x8e)
> > [0x2a9584ca42]
> > [burl-ct-v20z-13:14699] [ 7] src/MPI_Sendrecv_rtoa_c [0x403009]
> > [burl-ct-v20z-13:14699] [ 8] /lib64/tls/libc.so.6(__libc_start_main
> > +0xea) [0x2a961e0aaa]
> > [burl-ct-v20z-13:14699] [ 9] src/MPI_Sendrecv_rtoa_c(strtok+0x66)
> > [0x4019f2]
> > [burl-ct-v20z-13:14699] *** End of error message ***
> > [burl-ct-v20z-12][[13280,1],0][btl_tcp_endpoint.c:
> > 456:mca_btl_tcp_endpoint_recv_blocking] recv(13)
> > failed: Connection reset by peer (104)
> >
> --------------------------------------------------------------------------
> >
> > mpirun noticed that process rank 2 with PID 14699 on node burl-ct-
> > v20z-13 exited on signal 11
> > (Segmentation fault).
> >
> --------------------------------------------------------------------------
> >
> >
> > Seems like this is something we need to address before release -
> yes?
>
> I don't know if this needs to be addressed before release, but it
> was my
> impression that we've been living with these errors for a long time.
> They're intermittent (1% incidence rate????) and stacks come through
> coll_tuned or coll_hierarch or something and end up in the sm BTL. We
> discussed them not too long ago on this list. They predate 1.3.2. I
> think Terry said they seem hard to reproduce outside of MTT. (Terry
> is
> out this week.)
>
> Anyhow, my impression was that these were not new with this release.
> Would be nice to get off the books in any case. Need to figure out
> how
> to improve reproducibility and then dive into coll/sm stuff.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems