On Sat, Jun 9, 2012 at 3:35 PM, Eugene
Loh
<eugene.loh@oracle.com>
wrote:
On 6/9/2012 12:06 PM, Eugene Loh wrote:
With r26565:
Enable orte progress threads and libevent thread
support by default
Oracle MTT testing started showing new spawn_multiple
failures.
Sorry. I meant loop_spawn.
(And then, starting I think in 26582, the problem is masked
behind another issue, "oob:ud:qp_init could not create queue
pair", which is creating widespread problems for Cisco, IU,
and Oracle MTT testing. I suppose that's the subject of a
different e-mail thread.)
I've only seen this in 64-bit. Here are two segfaults,
both from Linux/x86 systems running over TCP:
This one with GNU compilers:
[...]
parent: MPI_Comm_spawn #300 return : 0
[burl-ct-v20z-26:28518] *** Process received signal
***
[burl-ct-v20z-26:28518] Signal: Segmentation fault
(11)
[burl-ct-v20z-26:28518] Signal code: Address not
mapped (1)
[burl-ct-v20z-26:28518] Failing at address: (nil)
[burl-ct-v20z-26:28518] [ 0] /lib64/libpthread.so.0
[0x3a21c0e7c0]
[burl-ct-v20z-26:28518] [ 1]
/lib64/libc.so.6(memcpy+0x35) [0x3a2107bde5]
[burl-ct-v20z-26:28518] [ 2]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_copy+0x58)
[burl-ct-v20z-26:28518] [ 3]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so
[burl-ct-v20z-26:28518] [ 4]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv_nb+0x314)
[burl-ct-v20z-26:28518] [ 5]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_rml_oob.so(orte_rml_oob_recv_buffer_nb+0xff)
[burl-ct-v20z-26:28518] [ 6]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_dpm_orte.so
[burl-ct-v20z-26:28518] [ 7]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/libmpi.so.0(PMPI_Comm_spawn+0x2ee)
[burl-ct-v20z-26:28518] [ 8] dynamic/loop_spawn
[0x40120b]
[burl-ct-v20z-26:28518] [ 9]
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3a2101d994]
[burl-ct-v20z-26:28518] [10] dynamic/loop_spawn
[0x400dd9]
[burl-ct-v20z-26:28518] *** End of error message ***
This one with Oracle Studio compilers:
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #20 return : 0
[burl-ct-x2200-12:02348] *** Process received signal
***
[burl-ct-x2200-12:02348] Signal: Segmentation fault
(11)
[burl-ct-x2200-12:02348] Signal code: Address not
mapped (1)
[burl-ct-x2200-12:02348] Failing at address: 0x10
[burl-ct-x2200-12:02348] [ 0] /lib64/libpthread.so.0
[0x318ac0de80]
[burl-ct-x2200-12:02348] [ 1]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0xe3)
[burl-ct-x2200-12:02348] [ 2]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so
[burl-ct-x2200-12:02348] [ 3]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0
[burl-ct-x2200-12:02348] [ 4]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0(opal_libevent2019_event_base_loop+0x7c7)
[burl-ct-x2200-12:02348] [ 5]
/workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0
[burl-ct-x2200-12:02348] [ 6] /lib64/libpthread.so.0
[0x318ac06307]
[burl-ct-x2200-12:02348] [ 7]
/lib64/libc.so.6(clone+0x6d) [0x318a0d1ded]
[burl-ct-x2200-12:02348] *** End of error message ***
Sometimes, I see a hang rather than a segfault.