Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open MPI 1.4: [connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory
From: Shamis, Pavel (shamisp_at_[hidden])
Date: 2011-08-01 17:27:04


You may find some initial XRC tuning documentation here :

https://svn.open-mpi.org/trac/ompi/ticket/1260

Pavel (Pasha) Shamis

---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
On Aug 1, 2011, at 11:41 AM, Yevgeny Kliteynik wrote:
> Hi,
> 
> Please try running OMPI with XRC:
> 
>  mpirun --mca btl openib... --mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ...
> 
> XRC (eXtended Reliable Connection) decreases memory consumption
> of Open MPI by decreasing number of QP per machine.
> 
> I'm not entirely sure that XRC is supported on OMPI 1.4, but I'm
> sure it is on later version of the 1.4 series (1.4.3).
> 
> BTW, I do know that the command line is extremely user friendly
> and completely intuitive... :-)
> I'll have an XRC entry on the OMPI FAQ web page in a day or two,
> and you can find more details about this issue.
> 
> OMPI FAQ: hxxp://www.open-mpi.org/faq/?category=openfabrics
> 
> -- YK
> 
> On 28-Jul-11 7:53 AM, 吕慧伟 wrote:
>> Dear all,
>> 
>> I have encounted a problem concerns running large jobs on SMP cluster with Open MPI 1.4.
>> The application need all-to-all communication, each process send messages to all other processes via MPI_Isend. It goes fine when running 256 processes, the problems occurs when process number >=512.
>> 
>> The error message is:
>>         mpirun -np 512 -machinefile machinefile.512 ./my_app
>>         [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory
>>         ...
>>         [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in endpoint reply start connect
>>         [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory
>>         ...
>>         mpirun has exited due to process rank 424 with PID 26841 on
>>         node gh31 exiting without calling "finalize".
>> 
>> Related post (hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php) suggests it may run out of HCA QP resources. So I checked my system configuration with 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, running with 256 processes would be under the limit: 256^2 = 65536 < 261056, but 512^2 = 262144 > 261056.
>> My question is: how to increase the max_qp number of InfiniBand or how to get around this problem in MPI?
>> 
>> Thanks in advance for any help you may give!
>> 
>> Huiwei Lv
>> PhD Student at Institute of Computing Technology
>> 
>> -------------------------
>> p.s. The system informantion is provided below:
>> $ ompi_info -v ompi full --parsable
>> ompi:version:full:1.4
>> ompi:version:svn:r22285
>> ompi:version:release_date:Dec 08, 2009
>> $ uname -a
>> Linux gh26 2 . 6 . 18-128 . el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
>> $ ulimit -l
>> unlimited
>> $ ibv_devinfo -v
>> hca_id: mlx4_0
>>         transport:                      InfiniBand (0)
>>         fw_ver:                         2.7.000
>>         node_guid:                      00d2:c910:0001:b6c0
>>         sys_image_guid:                 00d2:c910:0001:b6c3
>>         vendor_id:                      0x02c9
>>         vendor_part_id:                 26428
>>         hw_ver:                         0xB0
>>         board_id:                       MT_0D20110009
>>         phys_port_cnt:                  1
>>         max_mr_size:                    0xffffffffffffffff
>>         page_size_cap:                  0xfffffe00
>>         max_qp:                         261056
>>         max_qp_wr:                      16351
>>         device_cap_flags:               0x00fc9c76
>>         max_sge:                        32
>>         max_sge_rd:                     0
>>         max_cq:                         65408
>>         max_cqe:                        4194303
>>         max_mr:                         524272
>>         max_pd:                         32764
>>         max_qp_rd_atom:                 16
>>         max_ee_rd_atom:                 0
>>         max_res_rd_atom:                4176896
>>         max_qp_init_rd_atom:            128
>>         max_ee_init_rd_atom:            0
>>         atomic_cap:                     ATOMIC_HCA (1)
>>         max_ee:                         0
>>         max_rdd:                        0
>>         max_mw:                         0
>>         max_raw_ipv6_qp:                0
>>         max_raw_ethy_qp:                1
>>         max_mcast_grp:                  8192
>>         max_mcast_qp_attach:            56
>>         max_total_mcast_qp_attach:      458752
>>         max_ah:                         0
>>         max_fmr:                        0
>>         max_srq:                        65472
>>         max_srq_wr:                     16383
>>         max_srq_sge:                    31
>>         max_pkeys:                      128
>>         local_ca_ack_delay:             15
>>                 port:   1
>>                         state:                  PORT_ACTIVE (4)
>>                         max_mtu:                2048 (4)
>>                         active_mtu:             2048 (4)
>>                         sm_lid:                 86
>>                         port_lid:               73
>>                         port_lmc:               0x00
>>                         link_layer:             IB
>>                         max_msg_sz:             0x40000000
>>                         port_cap_flags:         0x02510868
>>                         max_vl_num:             8 (4)
>>                         bad_pkey_cntr:          0x0
>>                         qkey_viol_cntr:         0x0
>>                         sm_sl:                  0
>>                         pkey_tbl_len:           128
>>                         gid_tbl_len:            128
>>                         subnet_timeout:         18
>>                         init_type_reply:        0
>>                         active_width:           4X (2)
>>                         active_speed:           10.0 Gbps (4)
>>                         phys_state:             LINK_UP (5)
>>                         GID[  0]:               fe80:0000:0000:0000:00d2:c910:0001:b6c1
>> 
>> Related threads in the list:
>> hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php
>> hxxp://www.open-mpi.org/community/lists/users/2009/08/10456.php
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users