Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openib segfaults with Torque
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-06-09 20:10:28


I seem to recall that you have an IB-based cluster, right?

>From a *very quick* glance at the code, it looks like this might be a simple incorrect-finalization issue. That is:

- you run the job on a single server
- openib disqualifies itself because you're running on a single server
- openib then goes to finalize/close itself
- but openib didn't fully initialize itself (because it disqualified itself early in the initialization process), and something in the finalization process didn't take that into account

Nathan -- is that anywhere close to correct?

On Jun 5, 2014, at 5:10 PM, "Fischer, Greg A." <fischega_at_[hidden]> wrote:

> OpenMPI Users,
>
> After encountering difficulty with the Intel compilers (see the “intermittent segfaults with openib on ring_c.c” thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to work fine, so I went on my merry way compiling the rest of my dependencies.
>
> After getting my dependencies and applications compiled, I began observing segfaults when submitting the applications through Torque. I recompiled OpenMPI with debug options, ran “ring_c” over the openib BTL in an interactive Torque session (“qsub –I”), and got the backtrace below. All other system settings described in the previous thread are the same. Any thoughts on how to resolve this issue?
>
> Core was generated by `ring_c'.
> Program terminated with signal 6, Aborted.
> #0 0x00007f7f5920ab55 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0 0x00007f7f5920ab55 in raise () from /lib64/libc.so.6
> #1 0x00007f7f5920c0c5 in abort () from /lib64/libc.so.6
> #2 0x00007f7f59203a10 in __assert_fail () from /lib64/libc.so.6
> #3 0x00007f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
> #4 0x00007f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
> #5 0x00007f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x716680) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
> #6 0x00007f7f54885817 in btl_openib_component_init (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, enable_mpi_threads=false)
> at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
> #7 0x00007f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
> #8 0x00007f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
> #9 0x00007f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
> #10 0x00007f7f539ed739 in mca_pml_ob1_component_init (priority=0x7fff906aa630, enable_progress_threads=false, enable_mpi_threads=false)
> at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
> #11 0x00007f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
> #12 0x00007f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, requested=0, provided=0x7fff906aa7d8) at ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
> #13 0x00007f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, argv=0x7fff906aa820) at pinit.c:84
> #14 0x000000000040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19
>
> Greg
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/