Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] mlx4 error - looking for guidance
From: Pavel Shamis (Pasha) (pashash_at_[hidden])
Date: 2009-03-05 17:11:33


Do you have the same HCA adapter type on all of your machines ?
In the error log I see mlx4 error message , and mlx4 is connectX driver,
but ibv_devinfo show some older hca.

Pasha

Jeff Layton wrote:
> Pasha,
>
> Here you go... :) Thanks for looking at this.
>
> Jeff
>
> hca_id: mthca0
> fw_ver: 4.8.200
> node_guid: 0003:ba00:0100:38ac
> sys_image_guid: 0003:ba00:0100:38af
> vendor_id: 0x02c9
> vendor_part_id: 25208
> hw_ver: 0xA0
> board_id: MT_00B0010001
> phys_port_cnt: 2
> max_mr_size: 0xffffffffffffffff
> page_size_cap: 0xfffff000
> max_qp: 64512
> max_qp_wr: 65535
> device_cap_flags: 0x00001c76
> max_sge: 59
> max_sge_rd: 0
> max_cq: 65408
> max_cqe: 131071
> max_mr: 131056
> max_pd: 32768
> max_qp_rd_atom: 4
> max_ee_rd_atom: 0
> max_res_rd_atom: 258048
> max_qp_init_rd_atom: 128
> max_ee_init_rd_atom: 0
> atomic_cap: ATOMIC_HCA (1)
> max_ee: 0
> max_rdd: 0
> max_mw: 0
> max_raw_ipv6_qp: 0
> max_raw_ethy_qp: 0
> max_mcast_grp: 8192
> max_mcast_qp_attach: 56
> max_total_mcast_qp_attach: 458752
> max_ah: 0
> max_fmr: 0
> max_srq: 960
> max_srq_wr: 65535
> max_srq_sge: 31
> max_pkeys: 64
> local_ca_ack_delay: 15
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 41
> port_lid: 41
> port_lmc: 0x00
> max_msg_sz: 0x80000000
> port_cap_flags: 0x02510a6a
> max_vl_num: 8 (4)
> bad_pkey_cntr: 0x0
> qkey_viol_cntr: 0x0
> sm_sl: 0
> pkey_tbl_len: 64
> gid_tbl_len: 32
> subnet_timeout: 18
> init_type_reply: 0
> active_width: 4X (2)
> active_speed: 2.5 Gbps (1)
> phys_state: LINK_UP (5)
> GID[ 0]:
> fe80:0000:0000:0000:0003:ba00:0100:38ad
>
> port: 2
> state: PORT_DOWN (1)
> max_mtu: 2048 (4)
> active_mtu: 512 (2)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> max_msg_sz: 0x80000000
> port_cap_flags: 0x02510a68
> max_vl_num: 8 (4)
> bad_pkey_cntr: 0x0
> qkey_viol_cntr: 0x0
> sm_sl: 0
> pkey_tbl_len: 64
> gid_tbl_len: 32
> subnet_timeout: 0
> init_type_reply: 0
> active_width: 4X (2)
> active_speed: 2.5 Gbps (1)
> phys_state: POLLING (2)
> GID[ 0]:
> fe80:0000:0000:0000:0003:ba00:0100:38ae
>
>
>> Jeff,
>> Can you please provide more information about you HCA type
>> (ibv_devinfo -v).
>> Do you see this error immediate during startup, or you get it during
>> your run ?
>>
>> Thanks,
>> Pasha
>>
>> Jeff Layton wrote:
>>> Evening everyone,
>>>
>>> I'm running a CFD code on IB and I've encountered an error I'm not
>>> sure about and I'm looking for some guidance on where to start
>>> looking. Here's the error:
>>>
>>> mlx4: local QP operation err (QPN 260092, WQE index 9a9e0000, vendor
>>> syndrome 6f, opcode = 5e)
>>> [0,1,6][btl_openib_component.c:1392:btl_openib_component_progress]
>>> from compute-2-0.local to: compute-2-0.local erro
>>> r polling HP CQ with status LOCAL QP OPERATION ERROR status number 2
>>> for wr_id 37742320 opcode 0
>>> mpirun noticed that job rank 0 with PID 21220 on node
>>> compute-2-0.local exited on signal 15 (Terminated).
>>> 78 additional processes aborted (not shown)
>>>
>>>
>>> This is openmpi-1.2.9rc2 (sorry - need to upgrade to 1.3.0). The
>>> code works correctly for smaller cases, but when I run larger cases
>>> I get this error.
>>>
>>> I'm heading to bed but I'll check email tomorrow (so to sleep and
>>> run but it's been a long day).
>>>
>>> TIA!
>>>
>>> Jeff
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>