Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success
From: Pavel Shamis (Pasha) (pashash_at_[hidden])
Date: 2009-09-26 15:31:26


Very strange. MPI tries to access CQ context and it get immediate error.
Please make sure that you limits configuration is ok, take a look on
this FAQ - http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Pasha.

Charles Wright wrote:
> Hello,
> I just got some new cluster hardware :) :(
>
> I can't seem to overcome an openib problem
> I get this at run time
>
> error polling HP CQ with -2 errno says Success
>
> I've tried 2 different IB switches and multiple sets of nodes all on
> one switch or the other to try to eliminate the hardware. (IPoIB
> pings work and IB switches ree
> I've tried both v1.3.3 and v1.2.9 and get the same errors. I'm not
> really sure what these errors mean or how to get rid of them.
> My MPI application work if all the CPUs are on the same node (self btl
> only probably)
>
> Any advice would be appreciated. Thanks.
>
> asnrcw_at_dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm
> qsub: waiting for job 232035.mds1.asc.edu to start
> qsub: job 232035.mds1.asc.edu ready
>
> ####################################################################
> # Alabama Supercomputer Center - PBS Prologue
> # Your job id is : 232035
> # Your job name is : STDIN
> # Your job's queue is : sysadm
> # Your username for this job is : asnrcw
> # Your group for this job is : analyst
> # Your job used : # 8 CPUs on dmc101
> # 8 CPUs on dmc102
> # 8 CPUs on dmc103
> # 8 CPUs on dmc104
> # Your job started at : Fri Sep 25 10:20:05 CDT 2009
> ####################################################################
> asnrcw_at_dmc101:~> asnrcw_at_dmc101:~> asnrcw_at_dmc101:~> asnrcw_at_dmc101:~>
> asnrcw_at_dmc101:~> cd mpiprintrank
> asnrcw_at_dmc101:~/mpiprintrank> which mpirun
> /apps/openmpi-1.3.3-intel/bin/mpirun
> asnrcw_at_dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel
> [dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device]
> [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> error polling HP CQ with -2 errno says Success
> [dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device]
> [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> error polling HP CQ with -2 errno says Success
> [dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device]
> [dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc101][[46071,1],0][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> error polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc101][[46071,1],1][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device]
> [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> error polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc101][[46071,1],5][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device]
> [dmc101][[46071,1],2][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> error polling HP CQ with -2 errno says
> Success[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device]
> error polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
> [dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error
> polling HP CQ with -2 errno says Success
>
>
> System info:
> Compute nodes:
> http://www.supermicro.com/products/system/2U/6026/SYS-6026TT-IBXF.cfm
> Which has an integrated Mellanox Technologies MT26418 [ConnectX IB
> DDR, PCIe 2.0 5GT/s] (rev a0)
>
> asnrcw_at_dmc129:~> uname -a
>
> Linux dmc129 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200
> x86_64 x86_64 x86_64 GNU/Linux
>
> asnrcw_at_dmc129:~> rpm -qa | grep ofed
>
> ofed-doc-1.4.0-11.12
>
> ofed-1.4.0-11.12
>
> asnrcw_at_dmc129:~> cat /etc/SuSE-release
> SUSE Linux Enterprise Server 11 (x86_64)
>
> VERSION = 11
>
> PATCHLEVEL = 0
>
> asnrcw_at_dmc129:~>
>
>
> Subnet manager is running an a Voltaire 9024 DM Switch (firmware
> version 5.1.0)
>
>
>
> asnrcw_at_dmc129:~> ibv_devinfo
>
> hca_id: mlx4_0
>
> fw_ver: 2.6.000
>
> node_guid: 0030:48c8:b919:0000
>
> sys_image_guid: 0030:48c8:b919:0003
>
> vendor_id: 0x02c9
>
> vendor_part_id: 26418
>
> hw_ver: 0xA0
>
> board_id: SM_2081000001000
>
> phys_port_cnt: 1
>
> port: 1
>
> state: PORT_ACTIVE (4)
>
> max_mtu: 2048 (4)
>
> active_mtu: 2048 (4)
>
> sm_lid: 1
>
> port_lid: 139
>
> port_lmc: 0x00
>
> asnrcw_at_dmc129:~> ulimit -l
> unlimited
>
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users