On Apr 29, 2014, at 4:28 PM, Vince Grimes <tom.grimes_at_[hidden]> wrote:
> I realize it is no longer in the history of replies for this message, but the reason I am trying to use tcp instead of Infiniband is because:
> We are using an in-house program called ScalIT that performs operations on very large sparse distributed matrices.
> ScalIT works on other clusters with comparable hardware and software, but not ours.
> Other programs run just fine on our cluster using OpenMPI.
> ScalIT runs to completion using OpenMPI *on a single 12-core node*.
> It was suggested to me by another list member that I try forcing usage of tcp instead of Infiniband, so that's what I've been trying, just to see if it will work. I guess the tcp code is expected to be more reliable?
No, but it *should* be easier to configure...
We have previously seen instability of the IP-over-IB drivers, but I haven't been directly involved in the IB community for years, so that information may well be dated.
> The mca parameters used to produce the current error are: "--mca btl self,sm,tcp --mca btl_tcp_if_exclude lo,ib0"
> The previous Infiniband error message is:
> local QP operation err (QPN 7c1d43, WQE @ 00015005, CQN 7a009a, index 307512)
> [ 0] 007c1d43
> [ 4] 00000000
> [ 8] 00000000
> [ c] 00000000
>  026b2ed0
>  00000000
>  00015005
> [1c] ff100000
> [[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 246f300 opcode 128 vendor error 107 qp_idx 0
> It was also suggested that I disable eager RDMA. Doing this ("--mca btl_openib_use_eager_rdma 0") results in:
> [[30430,1],234][btl_openib_component.c:3492:handle_wc] from compute-1-18.local to: compute-6-10 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 2c41e80 opcode 128 vendor error 244 qp_idx 0
> All the Infiniband errors come in the same place with respect to the program output and reference the same OpenMPI code line. (It is notoriously difficult to trace through this program to be sure of the location in the code where the error occurs as ScalIT is written in appalling FORTRAN.)
Do you know for sure that this is a correct MPI application?
The errors you describe above may well be due to IB layer-0 kinds of errors (e.g., bad cables and/or bad HCAs), or they could be due to application errors (e.g., memory corruption).
I say this because if you're getting hangs in TCP and errors with IB, it could be that the application itself is faulty...
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/