Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mlx4 error - looking for guidance
From: Pavel Shamis (Pasha) (pashash_at_[hidden])
Date: 2009-03-05 17:11:33


Do you have the same HCA adapter type on all of your machines ?
In the error log I see mlx4 error message , and mlx4 is connectX driver,
but ibv_devinfo show some older hca.

Pasha

Jeff Layton wrote:
> Pasha,
>
> Here you go... :) Thanks for looking at this.
>
> Jeff
>
> hca_id: mthca0
> fw_ver: 4.8.200
> node_guid: 0003:ba00:0100:38ac
> sys_image_guid: 0003:ba00:0100:38af
> vendor_id: 0x02c9
> vendor_part_id: 25208
> hw_ver: 0xA0
> board_id: MT_00B0010001
> phys_port_cnt: 2
> max_mr_size: 0xffffffffffffffff
> page_size_cap: 0xfffff000
> max_qp: 64512
> max_qp_wr: 65535
> device_cap_flags: 0x00001c76
> max_sge: 59
> max_sge_rd: 0
> max_cq: 65408
> max_cqe: 131071
> max_mr: 131056
> max_pd: 32768
> max_qp_rd_atom: 4
> max_ee_rd_atom: 0
> max_res_rd_atom: 258048
> max_qp_init_rd_atom: 128
> max_ee_init_rd_atom: 0
> atomic_cap: ATOMIC_HCA (1)
> max_ee: 0
> max_rdd: 0
> max_mw: 0
> max_raw_ipv6_qp: 0
> max_raw_ethy_qp: 0
> max_mcast_grp: 8192
> max_mcast_qp_attach: 56
> max_total_mcast_qp_attach: 458752
> max_ah: 0
> max_fmr: 0
> max_srq: 960
> max_srq_wr: 65535
> max_srq_sge: 31
> max_pkeys: 64
> local_ca_ack_delay: 15
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 41
> port_lid: 41
> port_lmc: 0x00
> max_msg_sz: 0x80000000
> port_cap_flags: 0x02510a6a
> max_vl_num: 8 (4)
> bad_pkey_cntr: 0x0
> qkey_viol_cntr: 0x0
> sm_sl: 0
> pkey_tbl_len: 64
> gid_tbl_len: 32
> subnet_timeout: 18
> init_type_reply: 0
> active_width: 4X (2)
> active_speed: 2.5 Gbps (1)
> phys_state: LINK_UP (5)
> GID[ 0]:
> fe80:0000:0000:0000:0003:ba00:0100:38ad
>
> port: 2
> state: PORT_DOWN (1)
> max_mtu: 2048 (4)
> active_mtu: 512 (2)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> max_msg_sz: 0x80000000
> port_cap_flags: 0x02510a68
> max_vl_num: 8 (4)
> bad_pkey_cntr: 0x0
> qkey_viol_cntr: 0x0
> sm_sl: 0
> pkey_tbl_len: 64
> gid_tbl_len: 32
> subnet_timeout: 0
> init_type_reply: 0
> active_width: 4X (2)
> active_speed: 2.5 Gbps (1)
> phys_state: POLLING (2)
> GID[ 0]:
> fe80:0000:0000:0000:0003:ba00:0100:38ae
>
>
>> Jeff,
>> Can you please provide more information about you HCA type
>> (ibv_devinfo -v).
>> Do you see this error immediate during startup, or you get it during
>> your run ?
>>
>> Thanks,
>> Pasha
>>
>> Jeff Layton wrote:
>>> Evening everyone,
>>>
>>> I'm running a CFD code on IB and I've encountered an error I'm not
>>> sure about and I'm looking for some guidance on where to start
>>> looking. Here's the error:
>>>
>>> mlx4: local QP operation err (QPN 260092, WQE index 9a9e0000, vendor
>>> syndrome 6f, opcode = 5e)
>>> [0,1,6][btl_openib_component.c:1392:btl_openib_component_progress]
>>> from compute-2-0.local to: compute-2-0.local erro
>>> r polling HP CQ with status LOCAL QP OPERATION ERROR status number 2
>>> for wr_id 37742320 opcode 0
>>> mpirun noticed that job rank 0 with PID 21220 on node
>>> compute-2-0.local exited on signal 15 (Terminated).
>>> 78 additional processes aborted (not shown)
>>>
>>>
>>> This is openmpi-1.2.9rc2 (sorry - need to upgrade to 1.3.0). The
>>> code works correctly for smaller cases, but when I run larger cases
>>> I get this error.
>>>
>>> I'm heading to bed but I'll check email tomorrow (so to sleep and
>>> run but it's been a long day).
>>>
>>> TIA!
>>>
>>> Jeff
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>