You may find some initial XRC tuning documentation here :
https://svn.open-mpi.org/trac/ompi/ticket/1260
Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
On Aug 1, 2011, at 11:41 AM, Yevgeny Kliteynik wrote:
> Hi,
>
> Please try running OMPI with XRC:
>
> mpirun --mca btl openib... --mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ...
>
> XRC (eXtended Reliable Connection) decreases memory consumption
> of Open MPI by decreasing number of QP per machine.
>
> I'm not entirely sure that XRC is supported on OMPI 1.4, but I'm
> sure it is on later version of the 1.4 series (1.4.3).
>
> BTW, I do know that the command line is extremely user friendly
> and completely intuitive... :-)
> I'll have an XRC entry on the OMPI FAQ web page in a day or two,
> and you can find more details about this issue.
>
> OMPI FAQ: hxxp://www.open-mpi.org/faq/?category=openfabrics
>
> -- YK
>
> On 28-Jul-11 7:53 AM, åæ
§ä¼ wrote:
>> Dear all,
>>
>> I have encounted a problem concerns running large jobs on SMP cluster with Open MPI 1.4.
>> The application need all-to-all communication, each process send messages to all other processes via MPI_Isend. It goes fine when running 256 processes, the problems occurs when process number >=512.
>>
>> The error message is:
>> mpirun -np 512 -machinefile machinefile.512 ./my_app
>> [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory
>> ...
>> [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in endpoint reply start connect
>> [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory
>> ...
>> mpirun has exited due to process rank 424 with PID 26841 on
>> node gh31 exiting without calling "finalize".
>>
>> Related post (hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php) suggests it may run out of HCA QP resources. So I checked my system configuration with 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, running with 256 processes would be under the limit: 256^2 = 65536 < 261056, but 512^2 = 262144 > 261056.
>> My question is: how to increase the max_qp number of InfiniBand or how to get around this problem in MPI?
>>
>> Thanks in advance for any help you may give!
>>
>> Huiwei Lv
>> PhD Student at Institute of Computing Technology
>>
>> -------------------------
>> p.s. The system informantion is provided below:
>> $ ompi_info -v ompi full --parsable
>> ompi:version:full:1.4
>> ompi:version:svn:r22285
>> ompi:version:release_date:Dec 08, 2009
>> $ uname -a
>> Linux gh26 2 . 6 . 18-128 . el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
>> $ ulimit -l
>> unlimited
>> $ ibv_devinfo -v
>> hca_id: mlx4_0
>> transport: InfiniBand (0)
>> fw_ver: 2.7.000
>> node_guid: 00d2:c910:0001:b6c0
>> sys_image_guid: 00d2:c910:0001:b6c3
>> vendor_id: 0x02c9
>> vendor_part_id: 26428
>> hw_ver: 0xB0
>> board_id: MT_0D20110009
>> phys_port_cnt: 1
>> max_mr_size: 0xffffffffffffffff
>> page_size_cap: 0xfffffe00
>> max_qp: 261056
>> max_qp_wr: 16351
>> device_cap_flags: 0x00fc9c76
>> max_sge: 32
>> max_sge_rd: 0
>> max_cq: 65408
>> max_cqe: 4194303
>> max_mr: 524272
>> max_pd: 32764
>> max_qp_rd_atom: 16
>> max_ee_rd_atom: 0
>> max_res_rd_atom: 4176896
>> max_qp_init_rd_atom: 128
>> max_ee_init_rd_atom: 0
>> atomic_cap: ATOMIC_HCA (1)
>> max_ee: 0
>> max_rdd: 0
>> max_mw: 0
>> max_raw_ipv6_qp: 0
>> max_raw_ethy_qp: 1
>> max_mcast_grp: 8192
>> max_mcast_qp_attach: 56
>> max_total_mcast_qp_attach: 458752
>> max_ah: 0
>> max_fmr: 0
>> max_srq: 65472
>> max_srq_wr: 16383
>> max_srq_sge: 31
>> max_pkeys: 128
>> local_ca_ack_delay: 15
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 2048 (4)
>> active_mtu: 2048 (4)
>> sm_lid: 86
>> port_lid: 73
>> port_lmc: 0x00
>> link_layer: IB
>> max_msg_sz: 0x40000000
>> port_cap_flags: 0x02510868
>> max_vl_num: 8 (4)
>> bad_pkey_cntr: 0x0
>> qkey_viol_cntr: 0x0
>> sm_sl: 0
>> pkey_tbl_len: 128
>> gid_tbl_len: 128
>> subnet_timeout: 18
>> init_type_reply: 0
>> active_width: 4X (2)
>> active_speed: 10.0 Gbps (4)
>> phys_state: LINK_UP (5)
>> GID[ 0]: fe80:0000:0000:0000:00d2:c910:0001:b6c1
>>
>> Related threads in the list:
>> hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php
>> hxxp://www.open-mpi.org/community/lists/users/2009/08/10456.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
|