2009/2/26 Brett Pemberton <brett_at_[hidden]>:
> [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org
> to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status
> number 12 for wr_id 38996224 opcode 0 qp_idx 0
What OS are you using? I've seen this error and many other Infiniband
related errors on RedHat enterprise linux 4 update 4, with ConnectX
cards and various versions of OFED, up to version 1.3. Depending on
the MCA parameters, I also see hangs often enough to make native
Infiniband unusable on this OS.
However, the openib btl works just fine on the same hardware and the
same OFED/OpenMPI stack when used with Centos 4.6. I suspect there
may be something about the kernel that is contributing to these
problems, but I haven't had a chance to test the kernel from 4.6 on
4.4.
mch
|