Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] openib segfaults with Torque
From: Fischer, Greg A. (fischega_at_[hidden])
Date: 2014-06-05 17:16:20


Here's the command I'm invoking and the terminal output. (Some of this information doesn't appear to be captured in the backtrace.)

[binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:04549] *** Process received signal ***
[binf316:04549] Signal: Aborted (6)
[binf316:04549] Signal code: (-6)
[binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
[binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
[binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
[binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
[binf316:04549] [ 4] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
[binf316:04549] [ 5] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
[binf316:04549] [ 6] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
[binf316:04549] [ 7] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
[binf316:04549] [ 8] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
[binf316:04549] [ 9] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
[binf316:04549] [10] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
[binf316:04549] [11] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
[binf316:04549] [12] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
[binf316:04549] [13] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
[binf316:04549] [14] /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
[binf316:04549] [15] ring_c[0x40096f]
[binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
[binf316:04549] [17] ring_c[0x400889]
[binf316:04549] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 4549 on node xxxx316 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

From: Fischer, Greg A.
Sent: Thursday, June 05, 2014 5:10 PM
To: users_at_[hidden]
Cc: Fischer, Greg A.
Subject: openib segfaults with Torque

OpenMPI Users,

After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to work fine, so I went on my merry way compiling the rest of my dependencies.

After getting my dependencies and applications compiled, I began observing segfaults when submitting the applications through Torque. I recompiled OpenMPI with debug options, ran "ring_c" over the openib BTL in an interactive Torque session ("qsub -I"), and got the backtrace below. All other system settings described in the previous thread are the same. Any thoughts on how to resolve this issue?

Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0 0x00007f7f5920ab55 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f7f5920ab55 in raise () from /lib64/libc.so.6
#1 0x00007f7f5920c0c5 in abort () from /lib64/libc.so.6
#2 0x00007f7f59203a10 in __assert_fail () from /lib64/libc.so.6
#3 0x00007f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4 0x00007f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5 0x00007f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x716680) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6 0x00007f7f54885817 in btl_openib_component_init (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, enable_mpi_threads=false)
    at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7 0x00007f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8 0x00007f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
#9 0x00007f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
#10 0x00007f7f539ed739 in mca_pml_ob1_component_init (priority=0x7fff906aa630, enable_progress_threads=false, enable_mpi_threads=false)
    at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x00007f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
#12 0x00007f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, requested=0, provided=0x7fff906aa7d8) at ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
#13 0x00007f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, argv=0x7fff906aa820) at pinit.c:84
#14 0x000000000040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19

Greg