Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] MTT tests: segv's with sm on large messages
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-05-04 07:33:05


Hi folks

In reviewing last night's MTT tests for the 1.3 branch, I am seeing
several segfault failures in the shared memory BTL when using large
messages. This occurred on both IU's sif machine and on Sun's tests.

Here is a typical stack from MTT:

MPITEST info (0): Starting MPI_Sendrecv: Root to all model test
[burl-ct-v20z-13:14699] *** Process received signal ***
[burl-ct-v20z-13:14699] Signal: Segmentation fault (11)
[burl-ct-v20z-13:14699] Signal code: (128)
[burl-ct-v20z-13:14699] Failing at address: (nil)
[burl-ct-v20z-13:14699] [ 0] /lib64/tls/libpthread.so.0 [0x2a960bc720]
[burl-ct-v20z-13:14699] [ 1]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_btl_sm.so(mca_btl_sm_send+0x7b)
[0x2a9786a7d3]
[burl-ct-v20z-13:14699] [ 2]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x5b2)
[0x2a97453942]
[burl-ct-v20z-13:14699] [ 3]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_pml_ob1.so(mca_pml_ob1_isend+0x4f2)
[0x2a9744b446]
[burl-ct-v20z-13:14699] [ 4]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x7e)
[0x2a98120bca]
[burl-ct-v20z-13:14699] [ 5]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/openmpi/
mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling+0x119)
[0x2a9812b111]
[burl-ct-v20z-13:14699] [ 6]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball-
testing/installs/ZCcL/install/lib/lib64/libmpi.so.0(PMPI_Barrier+0x8e)
[0x2a9584ca42]
[burl-ct-v20z-13:14699] [ 7] src/MPI_Sendrecv_rtoa_c [0x403009]
[burl-ct-v20z-13:14699] [ 8] /lib64/tls/libc.so.6(__libc_start_main
+0xea) [0x2a961e0aaa]
[burl-ct-v20z-13:14699] [ 9] src/MPI_Sendrecv_rtoa_c(strtok+0x66)
[0x4019f2]
[burl-ct-v20z-13:14699] *** End of error message ***
[burl-ct-v20z-12][[13280,1],0][btl_tcp_endpoint.c:
456:mca_btl_tcp_endpoint_recv_blocking] recv(13)
failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 14699 on node burl-ct-
v20z-13 exited on signal 11
(Segmentation fault).
--------------------------------------------------------------------------

Seems like this is something we need to address before release - yes?

Ralph