Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open MPI 1.4: [connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory
From: 吕慧伟 (lvhuiwei_at_[hidden])
Date: 2011-08-21 23:57:54


Thanks, YK and Pavel!
It works.

On Tue, Aug 2, 2011 at 4:52 PM, Yevgeny Kliteynik <
kliteyn_at_[hidden]> wrote:

> See this FAQ entry:
> http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc
>
> -- YK
>
> On 02-Aug-11 12:27 AM, Shamis, Pavel wrote:
> > You may find some initial XRC tuning documentation here :
> >
> > https://svn.open-mpi.org/trac/ompi/ticket/1260
> >
> > Pavel (Pasha) Shamis
> > ---
> > Application Performance Tools Group
> > Computer Science and Math Division
> > Oak Ridge National Laboratory
> >
> >
> >
> >
> >
> >
> > On Aug 1, 2011, at 11:41 AM, Yevgeny Kliteynik wrote:
> >
> >> Hi,
> >>
> >> Please try running OMPI with XRC:
> >>
> >> mpirun --mca btl openib... --mca btl_openib_receive_queues
> X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32
> ...
> >>
> >> XRC (eXtended Reliable Connection) decreases memory consumption
> >> of Open MPI by decreasing number of QP per machine.
> >>
> >> I'm not entirely sure that XRC is supported on OMPI 1.4, but I'm
> >> sure it is on later version of the 1.4 series (1.4.3).
> >>
> >> BTW, I do know that the command line is extremely user friendly
> >> and completely intuitive... :-)
> >> I'll have an XRC entry on the OMPI FAQ web page in a day or two,
> >> and you can find more details about this issue.
> >>
> >> OMPI FAQ: hxxp://www.open-mpi.org/faq/?category=openfabrics
> >>
> >> -- YK
> >>
> >> On 28-Jul-11 7:53 AM, 吕慧伟 wrote:
> >>> Dear all,
> >>>
> >>> I have encounted a problem concerns running large jobs on SMP cluster
> with Open MPI 1.4.
> >>> The application need all-to-all communication, each process send
> messages to all other processes via MPI_Isend. It goes fine when running 256
> processes, the problems occurs when process number>=512.
> >>>
> >>> The error message is:
> >>> mpirun -np 512 -machinefile machinefile.512 ./my_app
> >>>
> [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one]
> error creating qp errno says Cannot allocate memory
> >>> ...
> >>>
> [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb]
> error in endpoint reply start connect
> >>>
> [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one]
> error creating qp errno says Cannot allocate memory
> >>> ...
> >>> mpirun has exited due to process rank 424 with PID 26841 on
> >>> node gh31 exiting without calling "finalize".
> >>>
> >>> Related post (hxxp://
> www.open-mpi.org/community/lists/users/2009/07/9786.php) suggests it may
> run out of HCA QP resources. So I checked my system configuration with
> 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, running with 256
> processes would be under the limit: 256^2 = 65536< 261056, but 512^2 =
> 262144> 261056.
> >>> My question is: how to increase the max_qp number of InfiniBand or how
> to get around this problem in MPI?
> >>>
> >>> Thanks in advance for any help you may give!
> >>>
> >>> Huiwei Lv
> >>> PhD Student at Institute of Computing Technology
> >>>
> >>> -------------------------
> >>> p.s. The system informantion is provided below:
> >>> $ ompi_info -v ompi full --parsable
> >>> ompi:version:full:1.4
> >>> ompi:version:svn:r22285
> >>> ompi:version:release_date:Dec 08, 2009
> >>> $ uname -a
> >>> Linux gh26 2 . 6 . 18-128 . el5 #1 SMP Wed Jan 21 10:41:14 EST 2009
> x86_64 x86_64 x86_64 GNU/Linux
> >>> $ ulimit -l
> >>> unlimited
> >>> $ ibv_devinfo -v
> >>> hca_id: mlx4_0
> >>> transport: InfiniBand (0)
> >>> fw_ver: 2.7.000
> >>> node_guid: 00d2:c910:0001:b6c0
> >>> sys_image_guid: 00d2:c910:0001:b6c3
> >>> vendor_id: 0x02c9
> >>> vendor_part_id: 26428
> >>> hw_ver: 0xB0
> >>> board_id: MT_0D20110009
> >>> phys_port_cnt: 1
> >>> max_mr_size: 0xffffffffffffffff
> >>> page_size_cap: 0xfffffe00
> >>> max_qp: 261056
> >>> max_qp_wr: 16351
> >>> device_cap_flags: 0x00fc9c76
> >>> max_sge: 32
> >>> max_sge_rd: 0
> >>> max_cq: 65408
> >>> max_cqe: 4194303
> >>> max_mr: 524272
> >>> max_pd: 32764
> >>> max_qp_rd_atom: 16
> >>> max_ee_rd_atom: 0
> >>> max_res_rd_atom: 4176896
> >>> max_qp_init_rd_atom: 128
> >>> max_ee_init_rd_atom: 0
> >>> atomic_cap: ATOMIC_HCA (1)
> >>> max_ee: 0
> >>> max_rdd: 0
> >>> max_mw: 0
> >>> max_raw_ipv6_qp: 0
> >>> max_raw_ethy_qp: 1
> >>> max_mcast_grp: 8192
> >>> max_mcast_qp_attach: 56
> >>> max_total_mcast_qp_attach: 458752
> >>> max_ah: 0
> >>> max_fmr: 0
> >>> max_srq: 65472
> >>> max_srq_wr: 16383
> >>> max_srq_sge: 31
> >>> max_pkeys: 128
> >>> local_ca_ack_delay: 15
> >>> port: 1
> >>> state: PORT_ACTIVE (4)
> >>> max_mtu: 2048 (4)
> >>> active_mtu: 2048 (4)
> >>> sm_lid: 86
> >>> port_lid: 73
> >>> port_lmc: 0x00
> >>> link_layer: IB
> >>> max_msg_sz: 0x40000000
> >>> port_cap_flags: 0x02510868
> >>> max_vl_num: 8 (4)
> >>> bad_pkey_cntr: 0x0
> >>> qkey_viol_cntr: 0x0
> >>> sm_sl: 0
> >>> pkey_tbl_len: 128
> >>> gid_tbl_len: 128
> >>> subnet_timeout: 18
> >>> init_type_reply: 0
> >>> active_width: 4X (2)
> >>> active_speed: 10.0 Gbps (4)
> >>> phys_state: LINK_UP (5)
> >>> GID[ 0]:
> fe80:0000:0000:0000:00d2:c910:0001:b6c1
> >>>
> >>> Related threads in the list:
> >>> hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php
> >>> hxxp://www.open-mpi.org/community/lists/users/2009/08/10456.php
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>