Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success
From: Charles Wright (charles_at_[hidden])
Date: 2009-09-28 14:23:55


I've verified that ulimit -l is unlimited everywhere.

After further testing I think the errors are related to OFED not openmpi.
I've uninstalled the OFED that comes with SLES (1.4.0) and installed
OFED 1.4.2 and 1.5-beta and I don't get the errors.

I got the idea to swap out OFED that after reading this:
http://kerneltrap.org/mailarchive/openfabrics-general/2008/11/3/3903184

Under OFED 1.4.0 (from SLES 11) I had to set options mlx4_core msi_x=0
in /etc/modprobe.conf.local to even get the mlx4 module to load.
I found that advice here:
http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1254161827534+28353475&threadId=1361415
(Under 1.4.2 and 1.5-Beta the modules load fine without mlx4_core
msi_x=0 being set)

Now my problem is that with OFED 1.4.2 and 1.5-beta the system hang and
the GigE network stops working and I have to power cycle nodes to login.

I'm going to try to get some help from the OFED mailing list now.

Pavel Shamis (Pasha) wrote:
> Very strange. MPI tries to access CQ context and it get immediate error.
> Please make sure that you limits configuration is ok, take a look on
> this FAQ -
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
> Pasha.
>
>
> Charles Wright wrote:
>> Hello,
>> I just got some new cluster hardware :) :(
>>
>> I can't seem to overcome an openib problem
>> I get this at run time
>>
>> error polling HP CQ with -2 errno says Success
>>
>> I've tried 2 different IB switches and multiple sets of nodes all on
>> one switch or the other to try to eliminate the hardware. (IPoIB
>> pings work and IB switches ree
>> I've tried both v1.3.3 and v1.2.9 and get the same errors. I'm not
>> really sure what these errors mean or how to get rid of them.
>> My MPI application work if all the CPUs are on the same node (self
>> btl only probably)
>>
>> Any advice would be appreciated. Thanks.
>>
>> asnrcw_at_dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm
>> qsub: waiting for job 232035.mds1.asc.edu to start
>> qsub: job 232035.mds1.asc.edu ready
>>
>> ####################################################################
>> # Alabama Supercomputer Center - PBS Prologue
>> # Your job id is : 232035
>> # Your job name is : STDIN
>> # Your job's queue is : sysadm
>> # Your username for this job is : asnrcw
>> # Your group for this job is : analyst
>> # Your job used : # 8 CPUs on dmc101
>> # 8 CPUs on dmc102
>> # 8 CPUs on dmc103
>> # 8 CPUs on dmc104
>> # Your job started at : Fri Sep 25 10:20:05 CDT 2009
>> ####################################################################
>> asnrcw_at_dmc101:~> asnrcw_at_dmc101:~> asnrcw_at_dmc101:~> asnrcw_at_dmc101:~>
>> asnrcw_at_dmc101:~> cd mpiprintrank
>> asnrcw_at_dmc101:~/mpiprintrank> which mpirun
>> /apps/openmpi-1.3.3-intel/bin/mpirun
>> asnrcw_at_dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel
>> [dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device]
>> [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> error polling HP CQ with -2 errno says Success
>> [dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device]
>> [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> error polling HP CQ with -2 errno says Success
>> [dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device]
>> [dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc101][[46071,1],0][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> error polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc101][[46071,1],1][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device]
>> [dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> error polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc101][[46071,1],5][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device]
>> [dmc101][[46071,1],2][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> error polling HP CQ with -2 errno says
>> Success[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device]
>> error polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>> [dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error
>> polling HP CQ with -2 errno says Success
>>
>>
>> System info:
>> Compute nodes:
>> http://www.supermicro.com/products/system/2U/6026/SYS-6026TT-IBXF.cfm
>> Which has an integrated Mellanox Technologies MT26418 [ConnectX IB
>> DDR, PCIe 2.0 5GT/s] (rev a0)
>>
>> asnrcw_at_dmc129:~> uname -a
>>
>> Linux dmc129 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> asnrcw_at_dmc129:~> rpm -qa | grep ofed
>>
>> ofed-doc-1.4.0-11.12
>>
>> ofed-1.4.0-11.12
>>
>> asnrcw_at_dmc129:~> cat /etc/SuSE-release
>> SUSE Linux Enterprise Server 11 (x86_64)
>>
>> VERSION = 11
>>
>> PATCHLEVEL = 0
>>
>> asnrcw_at_dmc129:~>
>>
>>
>> Subnet manager is running an a Voltaire 9024 DM Switch (firmware
>> version 5.1.0)
>>
>>
>>
>> asnrcw_at_dmc129:~> ibv_devinfo
>>
>> hca_id: mlx4_0
>>
>> fw_ver: 2.6.000
>>
>> node_guid: 0030:48c8:b919:0000
>>
>> sys_image_guid: 0030:48c8:b919:0003
>>
>> vendor_id: 0x02c9
>>
>> vendor_part_id: 26418
>>
>> hw_ver: 0xA0
>>
>> board_id: SM_2081000001000
>>
>> phys_port_cnt: 1
>>
>> port: 1
>>
>> state: PORT_ACTIVE (4)
>>
>> max_mtu: 2048 (4)
>>
>> active_mtu: 2048 (4)
>>
>> sm_lid: 1
>>
>> port_lid: 139
>>
>> port_lmc: 0x00
>>
>> asnrcw_at_dmc129:~> ulimit -l
>> unlimited
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Charles Wright, HPC Systems Specialist
Computer Sciences Corporation
High Performance Computing Center of Excellence
http://www.cschpc.com
(256)971-7429
cwrigh31_at_[hidden]