Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] openib segfaults with Torque
From: Fischer, Greg A. (fischega_at_[hidden])
Date: 2014-06-10 14:53:58


Yes, this fails on all nodes on the system, except for the head node.

The uptime of the system isn't significant. Maybe 1 week, and it's received basically no use.

-----Original Message-----
From: Nathan Hjelm [mailto:hjelmn_at_[hidden]]
Sent: Tuesday, June 10, 2014 2:49 PM
To: Fischer, Greg A.
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Well, thats interesting. The output shows that ibv_create_cq is failing. Strange since an identical call had just succeeded (udcm creates two completion queues). Some questions that might indicate where the failure might be:

Does this fail on any other node in your system?

How long has the node been up?

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> Jeff/Nathan,
>
> I ran the following with my debug build of OpenMPI 1.8.1 - after opening a terminal on a compute node with "qsub -l nodes 2 -I":
>
> mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &>
> output.txt
>
> Output and backtrace are attached. Let me know if I can provide anything else.
>
> Thanks for looking into this,
> Greg
>
> -----Original Message-----
> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff
> Squyres (jsquyres)
> Sent: Tuesday, June 10, 2014 10:31 AM
> To: Nathan Hjelm
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>
> Greg:
>
> Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly?
>
>
>
> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
>
> > On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote:
> >> I seem to recall that you have an IB-based cluster, right?
> >>
> >> From a *very quick* glance at the code, it looks like this might be a simple incorrect-finalization issue. That is:
> >>
> >> - you run the job on a single server
> >> - openib disqualifies itself because you're running on a single
> >> server
> >> - openib then goes to finalize/close itself
> >> - but openib didn't fully initialize itself (because it
> >> disqualified itself early in the initialization process), and
> >> something in the finalization process didn't take that into account
> >>
> >> Nathan -- is that anywhere close to correct?
> >
> > Nope. udcm_module_finalize is being called because there was an
> > error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> > The opal_list_t destructor is getting an assert failure. Probably
> > because the constructor wasn't called. I can rearrange the
> > constructors to be called first but there appears to be a deeper
> > issue with the user's
> > system: udcm_module_init should not be failing! It creates a couple
> > of CQs, allocates a small number of registered bufferes and starts
> > monitoring the fd for the completion channel. All these things are
> > also done in the setup of the openib btl itself. Keep in mind that
> > the openib btl will not disqualify itself when running single server.
> > Openib may be used to communicate on node and is needed for the dynamics case.
> >
> > The user might try adding -mca btl_base_verbose 100 to shed some
> > light on what the real issue is.
> >
> > BTW, I no longer monitor the user mailing list. If something needs
> > my attention forward it to me directly.
> >
> > -Nathan
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

> Core was generated by `ring_c'.
> Program terminated with signal 6, Aborted.
> #0 0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6
> #0 0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6
> #1 0x00007f8b6ae1e0c5 in abort () from /lib64/libc.so.6
> #2 0x00007f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
> #3 0x00007f8b664b684b in udcm_module_finalize (btl=0x717060,
> cpc=0x7190c0) at
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co
> nnect_udcm.c:734
> #4 0x00007f8b664b5474 in udcm_component_query (btl=0x717060,
> cpc=0x718a48) at
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co
> nnect_udcm.c:476
> #5 0x00007f8b664ae316 in
> ompi_btl_openib_connect_base_select_for_local_port (btl=0x717060) at
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co
> nnect_base.c:273
> #6 0x00007f8b66497817 in btl_openib_component_init (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, enable_mpi_threads=false)
> at
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.
> c:2703
> #7 0x00007f8b6b43fa5e in mca_btl_base_select
> (enable_progress_threads=false, enable_mpi_threads=false) at
> ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
> #8 0x00007f8b666d9d42 in mca_bml_r2_component_init (priority=0x7fffe34cecb4, enable_progress_threads=false, enable_mpi_threads=false)
> at
> ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
> #9 0x00007f8b6b43ed1b in mca_bml_base_init
> (enable_progress_threads=false, enable_mpi_threads=false) at
> ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
> #10 0x00007f8b655ff739 in mca_pml_ob1_component_init (priority=0x7fffe34cedf0, enable_progress_threads=false, enable_mpi_threads=false)
> at
> ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
> #11 0x00007f8b6b4659b2 in mca_pml_base_select
> (enable_progress_threads=false, enable_mpi_threads=false) at
> ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
> #12 0x00007f8b6b3d233c in ompi_mpi_init (argc=1, argv=0x7fffe34cf0e8,
> requested=0, provided=0x7fffe34cef98) at
> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
> #13 0x00007f8b6b407386 in PMPI_Init (argc=0x7fffe34cefec,
> argv=0x7fffe34cefe0) at pinit.c:84
> #14 0x000000000040096f in main (argc=1, argv=0x7fffe34cf0e8) at
> ring_c.c:19
>

> [binf316:24591] mca: base: components_register: registering btl
> components [binf316:24591] mca: base: components_register: found
> loaded component openib [binf316:24592] mca: base:
> components_register: registering btl components [binf316:24592] mca:
> base: components_register: found loaded component openib
> [binf316:24591] mca: base: components_register: component openib
> register function successful [binf316:24591] mca: base:
> components_register: found loaded component self [binf316:24591] mca:
> base: components_register: component self register function successful
> [binf316:24591] mca: base: components_open: opening btl components
> [binf316:24591] mca: base: components_open: found loaded component
> openib [binf316:24591] mca: base: components_open: component openib
> open function successful [binf316:24591] mca: base: components_open:
> found loaded component self [binf316:24591] mca: base:
> components_open: component self open function successful
> [binf316:24592] mca: base: components_register: component openib
> register function successful [binf316:24592] mca: base:
> components_register: found loaded component self [binf316:24592] mca:
> base: components_register: component self register function successful
> [binf316:24592] mca: base: components_open: opening btl components
> [binf316:24592] mca: base: components_open: found loaded component
> openib [binf316:24592] mca: base: components_open: component openib
> open function successful [binf316:24592] mca: base: components_open:
> found loaded component self [binf316:24592] mca: base:
> components_open: component self open function successful
> [binf316:24591] select: initializing btl component openib
> [binf316:24592] select: initializing btl component openib
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75
> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75
> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:686:init_one_port] looking for mlx4_0:1 GID
> index 0
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:717:init_one_port] my IB subnet_id for HCA
> mlx4_0 port 1 is fe80000000000000
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 rd_low is
> 192 rd_win 128 rd_rsv 4
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 rd_low is
> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 rd_low is
> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 rd_low is
> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query]
> rdmacm_component_query
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] Looking for
> mlx4_0:1 in IP address list
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] FOUND:
> mlx4_0:1 is 9.9.10.75 (0x4b0a0909)
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found device
> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):51845
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] creating new
> server to listen on 9.9.10.75 (0x4b0a0909):51845 [binf316:24591]
> openib BTL: rdmacm CPC available for use on mlx4_0:1
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] created cpc
> module 0x719220 for btl 0x716ee0
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:686:init_one_port] looking for mlx4_0:1 GID
> index 0
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:717:init_one_port] my IB subnet_id for HCA
> mlx4_0 port 1 is fe80000000000000
> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] error
> creating ud send completion queue
> ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
> [binf316:24591] *** Process received signal *** [binf316:24591]
> Signal: Aborted (6) [binf316:24591] Signal code: (-6)
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 rd_low is
> 192 rd_win 128 rd_rsv 4
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 rd_low is
> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 rd_low is
> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 rd_low is
> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query]
> rdmacm_component_query
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] Looking for
> mlx4_0:1 in IP address list
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] FOUND:
> mlx4_0:1 is 9.9.10.75 (0x4b0a0909)
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found device
> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):57734
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] creating new
> server to listen on 9.9.10.75 (0x4b0a0909):57734 [binf316:24592]
> openib BTL: rdmacm CPC available for use on mlx4_0:1
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] created cpc
> module 0x7190c0 for btl 0x717060
> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openi
> b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] error
> creating ud send completion queue
> ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
> [binf316:24592] *** Process received signal *** [binf316:24592]
> Signal: Aborted (6) [binf316:24592] Signal code: (-6) [binf316:24591]
> [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fb35959c7c0]
> [binf316:24591] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fb359248b55]
> [binf316:24591] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fb35924a131]
> [binf316:24591] [ 3]
> /lib64/libc.so.6(__assert_fail+0xf0)[0x7fb359241a10]
> [binf316:24591] [ 4]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bt
> l_openib.so(+0x3784b)[0x7fb3548e284b]
> [binf316:24591] [ 5]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bt
> l_openib.so(+0x36474)[0x7fb3548e1474]
> [binf316:24591] [ 6]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bt
> l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[
> 0x7fb3548da316] [binf316:24591] [ 7]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bt
> l_openib.so(+0x18817)[0x7fb3548c3817]
> [binf316:24591] [ 8] [binf316:24592] [ 0]
> /lib64/libpthread.so.0(+0xf7c0)[0x7f8b6b1707c0]
> [binf316:24592] [ 1]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mc
> a_btl_base_select+0x1b2)[0x7fb35986ba5e]
> [binf316:24591] [ 9] /lib64/libc.so.6(gsignal+0x35)[0x7f8b6ae1cb55]
> [binf316:24592] [ 2]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bm
> l_r2.so(mca_bml_r2_component_init+0x20)[0x7fb354b05d42]
> [binf316:24591] [10] /lib64/libc.so.6(abort+0x181)[0x7f8b6ae1e131]
> [binf316:24592] [ 3]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mc
> a_bml_base_init+0xd6)[0x7fb35986ad1b]
> [binf316:24591] [11]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pm
> l_ob1.so(+0x7739)[0x7fb353a2b739] [binf316:24591] [12]
> /lib64/libc.so.6(__assert_fail+0xf0)[0x7f8b6ae15a10]
> [binf316:24592] [ 4]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bt
> l_openib.so(+0x3784b)[0x7f8b664b684b]
> [binf316:24592] [ 5]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bt
> l_openib.so(+0x36474)[0x7f8b664b5474]
> [binf316:24592] [ 6]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mc
> a_pml_base_select+0x26e)[0x7fb3598919b2]
> [binf316:24591] [13]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bt
> l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[
> 0x7f8b664ae316] [binf316:24592] [ 7]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bt
> l_openib.so(+0x18817)[0x7f8b66497817]
> [binf316:24592] [ 8]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(om
> pi_mpi_init+0x5f6)[0x7fb3597fe33c]
> [binf316:24591] [14]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mc
> a_btl_base_select+0x1b2)[0x7f8b6b43fa5e]
> [binf316:24592] [ 9]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bm
> l_r2.so(mca_bml_r2_component_init+0x20)[0x7f8b666d9d42]
> [binf316:24592] [10]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MP
> I_Init+0x17e)[0x7fb359833386]
> [binf316:24591] [15] ring_c[0x40096f]
> [binf316:24591] [16]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mc
> a_bml_base_init+0xd6)[0x7f8b6b43ed1b]
> [binf316:24592] [11]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pm
> l_ob1.so(+0x7739)[0x7f8b655ff739] [binf316:24592] [12]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mc
> a_pml_base_select+0x26e)[0x7f8b6b4659b2]
> [binf316:24592] [13]
> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fb359234c36]
> [binf316:24591] [17] ring_c[0x400889]
> [binf316:24591] *** End of error message ***
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(om
> pi_mpi_init+0x5f6)[0x7f8b6b3d233c]
> [binf316:24592] [14]
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MP
> I_Init+0x17e)[0x7f8b6b407386]
> [binf316:24592] [15] ring_c[0x40096f]
> [binf316:24592] [16]
> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f8b6ae08c36]
> [binf316:24592] [17] ring_c[0x400889]
> [binf316:24592] *** End of error message ***
> ----------------------------------------------------------------------
> ---- mpirun noticed that process rank 0 with PID 24591 on node xxxx316
> exited on signal 6 (Aborted).
> ----------------------------------------------------------------------
> ----