Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openib segfaults with Torque
From: Martin Siegert (siegert_at_[hidden])
Date: 2014-06-11 19:05:04


It isn't really Torque that is imposing those constraints:
- the torque_mom initscript inherits from the OS whatever ulimits are
  in effect at that time;
- each job inherits the ulimits from the pbs_mom.

Thus, you need to change the ulimits from whatever is set at
startup time, e.g., in /etc/sysconfig/torque_mom:

ulimit -d unlimited
ulimit -s unlimited
ulimit -n 32768
ulimit -l 2097152

or whatever you consider to be reasonable.

Cheers,
Martin

-- 
Martin Siegert
WestGrid/ComputeCanada
Simon Fraser University
Burnaby, British Columbia
On Wed, Jun 11, 2014 at 10:20:08PM +0000, Jeff Squyres (jsquyres) wrote:
> +1
> 
> On Jun 11, 2014, at 6:01 PM, Ralph Castain <rhc_at_[hidden]>
>  wrote:
> 
> > Yeah, I think we've seen that somewhere before too...
> > 
> > 
> > On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
> > 
> >> Agreed. The problem is not with UDCM. I don't think something is wrong with the system. I think his Torque is imposing major constraints on the maximum size that can be locked into memory.
> >> 
> >> Josh
> >> 
> >> 
> >> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
> >> Probably won't help to use RDMACM though as you will just see the
> >> resource failure somewhere else. UDCM is not the problem. Something is
> >> wrong with the system. Allocating a 512 entry CQ should not fail.
> >> 
> >> -Nathan
> >> 
> >> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
> >> >    I'm guessing it's a resource limitation issue coming from Torque.
> >> >
> >> >    Hmmmm...I found something interesting on the interwebs that looks awfully
> >> >    similar:
> >> >    http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
> >> >
> >> >    Greg, if the suggestion from the Torque users doesn't resolve your issue (
> >> >    "...adding the following line 'ulimit -l unlimited' to pbs_mom and
> >> >    restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead of
> >> >    UDCM, which is a pretty recent addition to the openIB BTL.) by setting:
> >> >
> >> >    -mca btl_openib_cpc_include rdmacm
> >> >
> >> >    Josh
> >> >
> >> >    On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
> >> >    <jsquyres_at_[hidden]> wrote:
> >> >
> >> >      Mellanox --
> >> >
> >> >      What would cause a CQ to fail to be created?
> >> >
> >> >      On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
> >> >      <fischega_at_[hidden]> wrote:
> >> >
> >> >      > Is there any other work around that I might try?  Something that
> >> >      avoids UDCM?
> >> >      >
> >> >      > -----Original Message-----
> >> >      > From: Fischer, Greg A.
> >> >      > Sent: Tuesday, June 10, 2014 2:59 PM
> >> >      > To: Nathan Hjelm
> >> >      > Cc: Open MPI Users; Fischer, Greg A.
> >> >      > Subject: RE: [OMPI users] openib segfaults with Torque
> >> >      >
> >> >      > [binf316:fischega] $ ulimit -m
> >> >      > unlimited
> >> >      >
> >> >      > Greg
> >> >      >
> >> >      > -----Original Message-----
> >> >      > From: Nathan Hjelm [mailto:hjelmn_at_[hidden]]
> >> >      > Sent: Tuesday, June 10, 2014 2:58 PM
> >> >      > To: Fischer, Greg A.
> >> >      > Cc: Open MPI Users
> >> >      > Subject: Re: [OMPI users] openib segfaults with Torque
> >> >      >
> >> >      > Out of curiosity what is the mlock limit on your system? If it is too
> >> >      low that can cause ibv_create_cq to fail. To check run ulimit -m.
> >> >      >
> >> >      > -Nathan Hjelm
> >> >      > Application Readiness, HPC-5, LANL
> >> >      >
> >> >      > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> >> >      >> Yes, this fails on all nodes on the system, except for the head node.
> >> >      >>
> >> >      >> The uptime of the system isn't significant. Maybe 1 week, and it's
> >> >      received basically no use.
> >> >      >>
> >> >      >> -----Original Message-----
> >> >      >> From: Nathan Hjelm [mailto:hjelmn_at_[hidden]]
> >> >      >> Sent: Tuesday, June 10, 2014 2:49 PM
> >> >      >> To: Fischer, Greg A.
> >> >      >> Cc: Open MPI Users
> >> >      >> Subject: Re: [OMPI users] openib segfaults with Torque
> >> >      >>
> >> >      >>
> >> >      >> Well, thats interesting. The output shows that ibv_create_cq is
> >> >      failing. Strange since an identical call had just succeeded (udcm
> >> >      creates two completion queues). Some questions that might indicate where
> >> >      the failure might be:
> >> >      >>
> >> >      >> Does this fail on any other node in your system?
> >> >      >>
> >> >      >> How long has the node been up?
> >> >      >>
> >> >      >> -Nathan Hjelm
> >> >      >> Application Readiness, HPC-5, LANL
> >> >      >>
> >> >      >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> >> >      >>> Jeff/Nathan,
> >> >      >>>
> >> >      >>> I ran the following with my debug build of OpenMPI 1.8.1 - after
> >> >      opening a terminal on a compute node with "qsub -l nodes 2 -I":
> >> >      >>>
> >> >      >>>      mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
> >> >      >>> ring_c &> output.txt
> >> >      >>>
> >> >      >>> Output and backtrace are attached. Let me know if I can provide
> >> >      anything else.
> >> >      >>>
> >> >      >>> Thanks for looking into this,
> >> >      >>> Greg
> >> >      >>>
> >> >      >>> -----Original Message-----
> >> >      >>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff
> >> >      >>> Squyres (jsquyres)
> >> >      >>> Sent: Tuesday, June 10, 2014 10:31 AM
> >> >      >>> To: Nathan Hjelm
> >> >      >>> Cc: Open MPI Users
> >> >      >>> Subject: Re: [OMPI users] openib segfaults with Torque
> >> >      >>>
> >> >      >>> Greg:
> >> >      >>>
> >> >      >>> Can you run with "--mca btl_base_verbose 100" on your debug build so
> >> >      that we can get some additional output to see why UDCM is failing to
> >> >      setup properly?
> >> >      >>>
> >> >      >>>
> >> >      >>>
> >> >      >>> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
> >> >      >>>
> >> >      >>>> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres)
> >> >      wrote:
> >> >      >>>>> I seem to recall that you have an IB-based cluster, right?
> >> >      >>>>>
> >> >      >>>>> From a *very quick* glance at the code, it looks like this might
> >> >      be a simple incorrect-finalization issue.  That is:
> >> >      >>>>>
> >> >      >>>>> - you run the job on a single server
> >> >      >>>>> - openib disqualifies itself because you're running on a single
> >> >      >>>>> server
> >> >      >>>>> - openib then goes to finalize/close itself
> >> >      >>>>> - but openib didn't fully initialize itself (because it
> >> >      >>>>> disqualified itself early in the initialization process), and
> >> >      >>>>> something in the finalization process didn't take that into
> >> >      >>>>> account
> >> >      >>>>>
> >> >      >>>>> Nathan -- is that anywhere close to correct?
> >> >      >>>>
> >> >      >>>> Nope. udcm_module_finalize is being called because there was an
> >> >      >>>> error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> >> >      >>>> The opal_list_t destructor is getting an assert failure. Probably
> >> >      >>>> because the constructor wasn't called. I can rearrange the
> >> >      >>>> constructors to be called first but there appears to be a deeper
> >> >      >>>> issue with the user's
> >> >      >>>> system: udcm_module_init should not be failing! It creates a
> >> >      >>>> couple of CQs, allocates a small number of registered bufferes and
> >> >      >>>> starts monitoring the fd for the completion channel. All these
> >> >      >>>> things are also done in the setup of the openib btl itself. Keep
> >> >      >>>> in mind that the openib btl will not disqualify itself when running
> >> >      single server.
> >> >      >>>> Openib may be used to communicate on node and is needed for the
> >> >      dynamics case.
> >> >      >>>>
> >> >      >>>> The user might try adding -mca btl_base_verbose 100 to shed some
> >> >      >>>> light on what the real issue is.
> >> >      >>>>
> >> >      >>>> BTW, I no longer monitor the user mailing list. If something needs
> >> >      >>>> my attention forward it to me directly.
> >> >      >>>>
> >> >      >>>> -Nathan
> >> >      >>>
> >> >      >>>
> >> >      >>> --
> >> >      >>> Jeff Squyres
> >> >      >>> jsquyres_at_[hidden]
> >> >      >>> For corporate legal information go to:
> >> >      >>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >> >      >>>
> >> >      >>> _______________________________________________
> >> >      >>> users mailing list
> >> >      >>> users_at_[hidden]
> >> >      >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >      >>>
> >> >      >>>
> >> >      >>
> >> >      >>> Core was generated by `ring_c'.
> >> >      >>> Program terminated with signal 6, Aborted.
> >> >      >>> #0  0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6
> >> >      >>> #0  0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6
> >> >      >>> #1  0x00007f8b6ae1e0c5 in abort () from /lib64/libc.so.6
> >> >      >>> #2  0x00007f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
> >> >      >>> #3  0x00007f8b664b684b in udcm_module_finalize (btl=0x717060,
> >> >      >>> cpc=0x7190c0) at
> >> >      >>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_
> >> >      >>> co
> >> >      >>> nnect_udcm.c:734
> >> >      >>> #4  0x00007f8b664b5474 in udcm_component_query (btl=0x717060,
> >> >      >>> cpc=0x718a48) at
> >> >      >>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_
> >> >      >>> co
> >> >      >>> nnect_udcm.c:476
> >> >      >>> #5  0x00007f8b664ae316 in
> >> >      >>> ompi_btl_openib_connect_base_select_for_local_port (btl=0x717060) at
> >> >      >>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_
> >> >      >>> co
> >> >      >>> nnect_base.c:273
> >> >      >>> #6  0x00007f8b66497817 in btl_openib_component_init
> >> >      (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false,
> >> >      enable_mpi_threads=false)
> >> >      >>>    at
> >> >      >>>
> >> >      ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.
> >> >      >>> c:2703
> >> >      >>> #7  0x00007f8b6b43fa5e in mca_btl_base_select
> >> >      >>> (enable_progress_threads=false, enable_mpi_threads=false) at
> >> >      >>> ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
> >> >      >>> #8  0x00007f8b666d9d42 in mca_bml_r2_component_init
> >> >      (priority=0x7fffe34cecb4, enable_progress_threads=false,
> >> >      enable_mpi_threads=false)
> >> >      >>>    at
> >> >      >>> ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
> >> >      >>> #9  0x00007f8b6b43ed1b in mca_bml_base_init
> >> >      >>> (enable_progress_threads=false, enable_mpi_threads=false) at
> >> >      >>> ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
> >> >      >>> #10 0x00007f8b655ff739 in mca_pml_ob1_component_init
> >> >      (priority=0x7fffe34cedf0, enable_progress_threads=false,
> >> >      enable_mpi_threads=false)
> >> >      >>>    at
> >> >      >>> ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:27
> >> >      >>> 1
> >> >      >>> #11 0x00007f8b6b4659b2 in mca_pml_base_select
> >> >      >>> (enable_progress_threads=false, enable_mpi_threads=false) at
> >> >      >>> ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
> >> >      >>> #12 0x00007f8b6b3d233c in ompi_mpi_init (argc=1,
> >> >      >>> argv=0x7fffe34cf0e8, requested=0, provided=0x7fffe34cef98) at
> >> >      >>> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
> >> >      >>> #13 0x00007f8b6b407386 in PMPI_Init (argc=0x7fffe34cefec,
> >> >      >>> argv=0x7fffe34cefe0) at pinit.c:84
> >> >      >>> #14 0x000000000040096f in main (argc=1, argv=0x7fffe34cf0e8) at
> >> >      >>> ring_c.c:19
> >> >      >>>
> >> >      >>
> >> >      >>> [binf316:24591] mca: base: components_register: registering btl
> >> >      >>> components [binf316:24591] mca: base: components_register: found
> >> >      >>> loaded component openib [binf316:24592] mca: base:
> >> >      >>> components_register: registering btl components [binf316:24592] mca:
> >> >      >>> base: components_register: found loaded component openib
> >> >      >>> [binf316:24591] mca: base: components_register: component openib
> >> >      >>> register function successful [binf316:24591] mca: base:
> >> >      >>> components_register: found loaded component self [binf316:24591]
> >> >      mca:
> >> >      >>> base: components_register: component self register function
> >> >      >>> successful [binf316:24591] mca: base: components_open: opening btl
> >> >      >>> components [binf316:24591] mca: base: components_open: found loaded
> >> >      >>> component openib [binf316:24591] mca: base: components_open:
> >> >      >>> component openib open function successful [binf316:24591] mca: base:
> >> >      components_open:
> >> >      >>> found loaded component self [binf316:24591] mca: base:
> >> >      >>> components_open: component self open function successful
> >> >      >>> [binf316:24592] mca: base: components_register: component openib
> >> >      >>> register function successful [binf316:24592] mca: base:
> >> >      >>> components_register: found loaded component self [binf316:24592]
> >> >      mca:
> >> >      >>> base: components_register: component self register function
> >> >      >>> successful [binf316:24592] mca: base: components_open: opening btl
> >> >      >>> components [binf316:24592] mca: base: components_open: found loaded
> >> >      >>> component openib [binf316:24592] mca: base: components_open:
> >> >      >>> component openib open function successful [binf316:24592] mca: base:
> >> >      components_open:
> >> >      >>> found loaded component self [binf316:24592] mca: base:
> >> >      >>> components_open: component self open function successful
> >> >      >>> [binf316:24591] select: initializing btl component openib
> >> >      >>> [binf316:24592] select: initializing btl component openib
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75
> >> >      >>> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75
> >> >      >>> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:686:init_one_port] looking for mlx4_0:1
> >> >      >>> GID index 0
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:717:init_one_port] my IB subnet_id for
> >> >      >>> HCA
> >> >      >>> mlx4_0 port 1 is fe80000000000000
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 rd_low
> >> >      >>> is
> >> >      >>> 192 rd_win 128 rd_rsv 4
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
> >> >      >>> rd_low is
> >> >      >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
> >> >      >>> rd_low is
> >> >      >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
> >> >      >>> rd_low is
> >> >      >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni
> >> >      >>> b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query]
> >> >      >>> rdmacm_component_query
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] Looking
> >> >      >>> for
> >> >      >>> mlx4_0:1 in IP address list
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] FOUND:
> >> >      >>> mlx4_0:1 is 9.9.10.75 (0x4b0a0909)
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found
> >> >      >>> device
> >> >      >>> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):51845
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] creating
> >> >      >>> new server to listen on 9.9.10.75 (0x4b0a0909):51845 [binf316:24591]
> >> >      >>> openib BTL: rdmacm CPC available for use on mlx4_0:1
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] created
> >> >      >>> cpc module 0x719220 for btl 0x716ee0
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:686:init_one_port] looking for mlx4_0:1
> >> >      >>> GID index 0
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:717:init_one_port] my IB subnet_id for
> >> >      >>> HCA
> >> >      >>> mlx4_0 port 1 is fe80000000000000
> >> >      >>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] error
> >> >      >>> creating ud send completion queue
> >> >      >>> ring_c:
> >> >      ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
> >> >      udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL)
> >> >      == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
> >> >      >>> [binf316:24591] *** Process received signal *** [binf316:24591]
> >> >      >>> Signal: Aborted (6) [binf316:24591] Signal code:  (-6)
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 rd_low
> >> >      >>> is
> >> >      >>> 192 rd_win 128 rd_rsv 4
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
> >> >      >>> rd_low is
> >> >      >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
> >> >      >>> rd_low is
> >> >      >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
> >> >      >>> rd_low is
> >> >      >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni
> >> >      >>> b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query]
> >> >      >>> rdmacm_component_query
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] Looking
> >> >      >>> for
> >> >      >>> mlx4_0:1 in IP address list
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] FOUND:
> >> >      >>> mlx4_0:1 is 9.9.10.75 (0x4b0a0909)
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found
> >> >      >>> device
> >> >      >>> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):57734
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] creating
> >> >      >>> new server to listen on 9.9.10.75 (0x4b0a0909):57734 [binf316:24592]
> >> >      >>> openib BTL: rdmacm CPC available for use on mlx4_0:1
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] created
> >> >      >>> cpc module 0x7190c0 for btl 0x717060
> >> >      >>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
> >> >      >>> ni b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] error
> >> >      >>> creating ud send completion queue
> >> >      >>> ring_c:
> >> >      ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
> >> >      udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL)
> >> >      == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
> >> >      >>> [binf316:24592] *** Process received signal *** [binf316:24592]
> >> >      >>> Signal: Aborted (6) [binf316:24592] Signal code:  (-6)
> >> >      >>> [binf316:24591] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fb35959c7c0]
> >> >      >>> [binf316:24591] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fb359248b55]
> >> >      >>> [binf316:24591] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fb35924a131]
> >> >      >>> [binf316:24591] [ 3]
> >> >      >>> /lib64/libc.so.6(__assert_fail+0xf0)[0x7fb359241a10]
> >> >      >>> [binf316:24591] [ 4]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bt l_openib.so(+0x3784b)[0x7fb3548e284b]
> >> >      >>> [binf316:24591] [ 5]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bt l_openib.so(+0x36474)[0x7fb3548e1474]
> >> >      >>> [binf316:24591] [ 6]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bt
> >> >      >>> l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b
> >> >      >>> )[ 0x7fb3548da316] [binf316:24591] [ 7]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bt l_openib.so(+0x18817)[0x7fb3548c3817]
> >> >      >>> [binf316:24591] [ 8] [binf316:24592] [ 0]
> >> >      >>> /lib64/libpthread.so.0(+0xf7c0)[0x7f8b6b1707c0]
> >> >      >>> [binf316:24592] [ 1]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> mc a_btl_base_select+0x1b2)[0x7fb35986ba5e]
> >> >      >>> [binf316:24591] [ 9] /lib64/libc.so.6(gsignal+0x35)[0x7f8b6ae1cb55]
> >> >      >>> [binf316:24592] [ 2]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bm l_r2.so(mca_bml_r2_component_init+0x20)[0x7fb354b05d42]
> >> >      >>> [binf316:24591] [10] /lib64/libc.so.6(abort+0x181)[0x7f8b6ae1e131]
> >> >      >>> [binf316:24592] [ 3]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> mc a_bml_base_init+0xd6)[0x7fb35986ad1b]
> >> >      >>> [binf316:24591] [11]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> pm l_ob1.so(+0x7739)[0x7fb353a2b739] [binf316:24591] [12]
> >> >      >>> /lib64/libc.so.6(__assert_fail+0xf0)[0x7f8b6ae15a10]
> >> >      >>> [binf316:24592] [ 4]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bt l_openib.so(+0x3784b)[0x7f8b664b684b]
> >> >      >>> [binf316:24592] [ 5]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bt l_openib.so(+0x36474)[0x7f8b664b5474]
> >> >      >>> [binf316:24592] [ 6]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> mc a_pml_base_select+0x26e)[0x7fb3598919b2]
> >> >      >>> [binf316:24591] [13]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bt
> >> >      >>> l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b
> >> >      >>> )[ 0x7f8b664ae316] [binf316:24592] [ 7]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bt l_openib.so(+0x18817)[0x7f8b66497817]
> >> >      >>> [binf316:24592] [ 8]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> om
> >> >      >>> pi_mpi_init+0x5f6)[0x7fb3597fe33c]
> >> >      >>> [binf316:24591] [14]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> mc a_btl_base_select+0x1b2)[0x7f8b6b43fa5e]
> >> >      >>> [binf316:24592] [ 9]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> bm l_r2.so(mca_bml_r2_component_init+0x20)[0x7f8b666d9d42]
> >> >      >>> [binf316:24592] [10]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> MP
> >> >      >>> I_Init+0x17e)[0x7fb359833386]
> >> >      >>> [binf316:24591] [15] ring_c[0x40096f] [binf316:24591] [16]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> mc a_bml_base_init+0xd6)[0x7f8b6b43ed1b]
> >> >      >>> [binf316:24592] [11]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
> >> >      >>> pm l_ob1.so(+0x7739)[0x7f8b655ff739] [binf316:24592] [12]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> mc a_pml_base_select+0x26e)[0x7f8b6b4659b2]
> >> >      >>> [binf316:24592] [13]
> >> >      >>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fb359234c36]
> >> >      >>> [binf316:24591] [17] ring_c[0x400889] [binf316:24591] *** End of
> >> >      >>> error message ***
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> om
> >> >      >>> pi_mpi_init+0x5f6)[0x7f8b6b3d233c]
> >> >      >>> [binf316:24592] [14]
> >> >      >>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
> >> >      >>> MP
> >> >      >>> I_Init+0x17e)[0x7f8b6b407386]
> >> >      >>> [binf316:24592] [15] ring_c[0x40096f] [binf316:24592] [16]
> >> >      >>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f8b6ae08c36]
> >> >      >>> [binf316:24592] [17] ring_c[0x400889] [binf316:24592] *** End of
> >> >      >>> error message ***
> >> >      >>> --------------------------------------------------------------------
> >> >      >>> --
> >> >      >>> ---- mpirun noticed that process rank 0 with PID 24591 on node
> >> >      >>> xxxx316 exited on signal 6 (Aborted).
> >> >      >>> --------------------------------------------------------------------
> >> >      >>> --
> >> >      >>> ----
> >> >      >>
> >> >      >>
> >> >      >
> >> >      > _______________________________________________
> >> >      > users mailing list
> >> >      > users_at_[hidden]
> >> >      > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >      > Link to this post:
> >> >      http://www.open-mpi.org/community/lists/users/2014/06/24632.php
> >> >
> >> >      --
> >> >      Jeff Squyres
> >> >      jsquyres_at_[hidden]
> >> >      For corporate legal information go to:
> >> >      http://www.cisco.com/web/about/doing_business/legal/cri/
> >> >
> >> >      _______________________________________________
> >> >      users mailing list
> >> >      users_at_[hidden]
> >> >      Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >      Link to this post:
> >> >      http://www.open-mpi.org/community/lists/users/2014/06/24633.php
> >> 
> >> > _______________________________________________
> >> > users mailing list
> >> > users_at_[hidden]
> >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> > Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24634.php
> >> 
> >> 
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24636.php
> >> 
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24637.php
> > 
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24638.php
> 
> 
> -- 
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24639.php