Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success
From: Charles Wright (charles_at_[hidden])
Date: 2009-09-25 12:13:10


Hello,
    I just got some new cluster hardware :) :(

I can't seem to overcome an openib problem
I get this at run time

        error polling HP CQ with -2 errno says Success

I've tried 2 different IB switches and multiple sets of nodes all on one
switch or the other to try to eliminate the hardware. (IPoIB pings
work and IB switches ree
I've tried both v1.3.3 and v1.2.9 and get the same errors. I'm not
really sure what these errors mean or how to get rid of them.
My MPI application work if all the CPUs are on the same node (self btl
only probably)

Any advice would be appreciated.
Thanks.

asnrcw_at_dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm
qsub: waiting for job 232035.mds1.asc.edu to start
qsub: job 232035.mds1.asc.edu ready

####################################################################
# Alabama Supercomputer Center - PBS Prologue
# Your job id is : 232035
# Your job name is : STDIN
# Your job's queue is : sysadm
# Your username for this job is : asnrcw
# Your group for this job is : analyst
# Your job used :
# 8 CPUs on dmc101
# 8 CPUs on dmc102
# 8 CPUs on dmc103
# 8 CPUs on dmc104
# Your job started at : Fri Sep 25 10:20:05 CDT 2009
####################################################################
asnrcw_at_dmc101:~>
asnrcw_at_dmc101:~>
asnrcw_at_dmc101:~>
asnrcw_at_dmc101:~>
asnrcw_at_dmc101:~> cd mpiprintrank
asnrcw_at_dmc101:~/mpiprintrank> which mpirun
/apps/openmpi-1.3.3-intel/bin/mpirun
asnrcw_at_dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel
[dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],0][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],1][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],5][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] [dmc101][[46071,1],2][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device]
error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error polling HP CQ with -2 errno says Success

System info:

Compute nodes:
http://www.supermicro.com/products/system/2U/6026/SYS-6026TT-IBXF.cfm
Which has an integrated Mellanox Technologies MT26418 [ConnectX IB DDR, PCIe 2.0 5GT/s] (rev a0)

asnrcw_at_dmc129:~> uname -a

Linux dmc129 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux

asnrcw_at_dmc129:~> rpm -qa | grep ofed

ofed-doc-1.4.0-11.12

ofed-1.4.0-11.12

asnrcw_at_dmc129:~> cat /etc/SuSE-release

SUSE Linux Enterprise Server 11 (x86_64)

VERSION = 11

PATCHLEVEL = 0

asnrcw_at_dmc129:~>

Subnet manager is running an a Voltaire 9024 DM Switch (firmware version 5.1.0)

asnrcw_at_dmc129:~> ibv_devinfo

hca_id: mlx4_0

        fw_ver: 2.6.000

        node_guid: 0030:48c8:b919:0000

        sys_image_guid: 0030:48c8:b919:0003

        vendor_id: 0x02c9

        vendor_part_id: 26418

        hw_ver: 0xA0

        board_id: SM_2081000001000

        phys_port_cnt: 1

                port: 1

                        state: PORT_ACTIVE (4)

                        max_mtu: 2048 (4)

                        active_mtu: 2048 (4)

                        sm_lid: 1

                        port_lid: 139

                        port_lmc: 0x00

asnrcw_at_dmc129:~> ulimit -l
unlimited