Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mlx4 error - looking for guidance
From: Jeff Layton (laytonjb_at_[hidden])
Date: 2009-03-05 16:13:37


Pasha,

Here you go... :) Thanks for looking at this.

Jeff

hca_id: mthca0
        fw_ver: 4.8.200
        node_guid: 0003:ba00:0100:38ac
        sys_image_guid: 0003:ba00:0100:38af
        vendor_id: 0x02c9
        vendor_part_id: 25208
        hw_ver: 0xA0
        board_id: MT_00B0010001
        phys_port_cnt: 2
        max_mr_size: 0xffffffffffffffff
        page_size_cap: 0xfffff000
        max_qp: 64512
        max_qp_wr: 65535
        device_cap_flags: 0x00001c76
        max_sge: 59
        max_sge_rd: 0
        max_cq: 65408
        max_cqe: 131071
        max_mr: 131056
        max_pd: 32768
        max_qp_rd_atom: 4
        max_ee_rd_atom: 0
        max_res_rd_atom: 258048
        max_qp_init_rd_atom: 128
        max_ee_init_rd_atom: 0
        atomic_cap: ATOMIC_HCA (1)
        max_ee: 0
        max_rdd: 0
        max_mw: 0
        max_raw_ipv6_qp: 0
        max_raw_ethy_qp: 0
        max_mcast_grp: 8192
        max_mcast_qp_attach: 56
        max_total_mcast_qp_attach: 458752
        max_ah: 0
        max_fmr: 0
        max_srq: 960
        max_srq_wr: 65535
        max_srq_sge: 31
        max_pkeys: 64
        local_ca_ack_delay: 15
                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 2048 (4)
                        active_mtu: 2048 (4)
                        sm_lid: 41
                        port_lid: 41
                        port_lmc: 0x00
                        max_msg_sz: 0x80000000
                        port_cap_flags: 0x02510a6a
                        max_vl_num: 8 (4)
                        bad_pkey_cntr: 0x0
                        qkey_viol_cntr: 0x0
                        sm_sl: 0
                        pkey_tbl_len: 64
                        gid_tbl_len: 32
                        subnet_timeout: 18
                        init_type_reply: 0
                        active_width: 4X (2)
                        active_speed: 2.5 Gbps (1)
                        phys_state: LINK_UP (5)
                        GID[ 0]:
fe80:0000:0000:0000:0003:ba00:0100:38ad

                port: 2
                        state: PORT_DOWN (1)
                        max_mtu: 2048 (4)
                        active_mtu: 512 (2)
                        sm_lid: 0
                        port_lid: 0
                        port_lmc: 0x00
                        max_msg_sz: 0x80000000
                        port_cap_flags: 0x02510a68
                        max_vl_num: 8 (4)
                        bad_pkey_cntr: 0x0
                        qkey_viol_cntr: 0x0
                        sm_sl: 0
                        pkey_tbl_len: 64
                        gid_tbl_len: 32
                        subnet_timeout: 0
                        init_type_reply: 0
                        active_width: 4X (2)
                        active_speed: 2.5 Gbps (1)
                        phys_state: POLLING (2)
                        GID[ 0]:
fe80:0000:0000:0000:0003:ba00:0100:38ae

> Jeff,
> Can you please provide more information about you HCA type
> (ibv_devinfo -v).
> Do you see this error immediate during startup, or you get it during
> your run ?
>
> Thanks,
> Pasha
>
> Jeff Layton wrote:
>> Evening everyone,
>>
>> I'm running a CFD code on IB and I've encountered an error I'm not
>> sure about and I'm looking for some guidance on where to start
>> looking. Here's the error:
>>
>> mlx4: local QP operation err (QPN 260092, WQE index 9a9e0000, vendor
>> syndrome 6f, opcode = 5e)
>> [0,1,6][btl_openib_component.c:1392:btl_openib_component_progress]
>> from compute-2-0.local to: compute-2-0.local erro
>> r polling HP CQ with status LOCAL QP OPERATION ERROR status number 2
>> for wr_id 37742320 opcode 0
>> mpirun noticed that job rank 0 with PID 21220 on node
>> compute-2-0.local exited on signal 15 (Terminated).
>> 78 additional processes aborted (not shown)
>>
>>
>> This is openmpi-1.2.9rc2 (sorry - need to upgrade to 1.3.0). The code
>> works correctly for smaller cases, but when I run larger cases I get
>> this error.
>>
>> I'm heading to bed but I'll check email tomorrow (so to sleep and run
>> but it's been a long day).
>>
>> TIA!
>>
>> Jeff
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>