Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] FW: LOCAL QP OPERATION ERROR
From: Joshua Ladd (joshual_at_[hidden])
Date: 2014-03-11 20:28:27


Hi, Vince

Have you tried with a different BTL? In particular, have you tried with the TCP BTL? Please try setting "-mca btl sm,self,tcp" and see if you still run into the issue.

How is your OMPI configured?

Josh

> From: Vince Grimes <tom.grimes_at_[hidden]>
> Subject: [OMPI users] LOCAL QP OPERATION ERROR
> Date: March 5, 2014 5:21:51 PM EST
> To: <users_at_[hidden]>
> Reply-To: Open MPI Users <users_at_[hidden]>
>
> OpenMPI folks:
>
> I am having trouble running a specific program (ScalIT, a code produced and maintained by one of the research groups here at TTU) using Infiniband. There is conflicting information that has made it impossible to diagnose the problem:
>
> 1) Other programs (like NWChem) run using OpenMPI over multiple nodes using Infiniband without any problems at all.
>
> 2) ScalIT runs on other clusters (and I believe with OpenMPI) without error.
>
> 3) ScalIT runs with OpenMPI on a single node without error.
>
> 4) ScalIT dies at a particular place with OpenMPI over multiple nodes (20) with OpenMPI.
>
> I don't know whether it is a hardware problem (but other codes work just fine) or a programming error in ScalIT (but it works without modification on other clusters).
>
> The error I am getting is:
> local QP operation err (QPN 0014bc, WQE @ 00009005, CQN 000097, index
> 2232620) [ 0] 000014bc [ 4] 00000000 [ 8] 00000000 [ c] 00000000
> [10] 026f3410 [14] 00000000 [18] 00009005 [1c] ff100000
> [[44095,1],45][btl_openib_component.c:3492:handle_wc] from
> compute-6-13.local to: compute-3-11 error polling LP CQ with status
> LOCAL QP OPERATION ERROR status number 2 for wr_id 40c5e00 opcode 0
> vendor error 111 qp_idx 0
> ----------------------------------------------------------------------
> ---- mpirun has exited due to process rank 45 with PID 27168 on node
> compute-6-13.local exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in the
> job did. This can cause a job to hang indefinitely while it waits for
> all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> ----------------------------------------------------------------------
> ----
>
> I am using OpenMPI 1.6.5 compiled with the Intel 11.1-080 compilers.
>
> `uname -a` returns "Linux compute-1-1.local 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux"
>
> ibv_devinfo returns
> hca_id: mthca0
> transport: InfiniBand (0)
> fw_ver: 1.2.0
> node_guid: 0005:ad00:001f:fed8
> sys_image_guid: 0005:ad00:0100:d050
> vendor_id: 0x02c9
> vendor_part_id: 25204
> hw_ver: 0xA0
> board_id: MT_03B0120002
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 39
> port_lmc: 0x00
> link_layer: IB
>
>
> Any help in tracking down the problem is greatly appreciated.
>
> --
> T. Vince Grimes, Ph.D.
> CCC System Administrator
>
> Texas Tech University
> Dept. of Chemistry and Biochemistry (10A) Box 41061 Lubbock, TX
> 79409-1061
>
> (806) 834-0813 (voice); (806) 742-1289 (fax)
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/