Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] FW: LOCAL QP OPERATION ERROR
From: Joshua Ladd (joshual_at_[hidden])
Date: 2014-03-11 20:28:27


Hi, Vince

Have you tried with a different BTL? In particular, have you tried with the TCP BTL? Please try setting "-mca btl sm,self,tcp" and see if you still run into the issue.

How is your OMPI configured?

Josh

> From: Vince Grimes <tom.grimes_at_[hidden]>
> Subject: [OMPI users] LOCAL QP OPERATION ERROR
> Date: March 5, 2014 5:21:51 PM EST
> To: <users_at_[hidden]>
> Reply-To: Open MPI Users <users_at_[hidden]>
>
> OpenMPI folks:
>
> I am having trouble running a specific program (ScalIT, a code produced and maintained by one of the research groups here at TTU) using Infiniband. There is conflicting information that has made it impossible to diagnose the problem:
>
> 1) Other programs (like NWChem) run using OpenMPI over multiple nodes using Infiniband without any problems at all.
>
> 2) ScalIT runs on other clusters (and I believe with OpenMPI) without error.
>
> 3) ScalIT runs with OpenMPI on a single node without error.
>
> 4) ScalIT dies at a particular place with OpenMPI over multiple nodes (20) with OpenMPI.
>
> I don't know whether it is a hardware problem (but other codes work just fine) or a programming error in ScalIT (but it works without modification on other clusters).
>
> The error I am getting is:
> local QP operation err (QPN 0014bc, WQE @ 00009005, CQN 000097, index
> 2232620) [ 0] 000014bc [ 4] 00000000 [ 8] 00000000 [ c] 00000000
> [10] 026f3410 [14] 00000000 [18] 00009005 [1c] ff100000
> [[44095,1],45][btl_openib_component.c:3492:handle_wc] from
> compute-6-13.local to: compute-3-11 error polling LP CQ with status
> LOCAL QP OPERATION ERROR status number 2 for wr_id 40c5e00 opcode 0
> vendor error 111 qp_idx 0
> ----------------------------------------------------------------------
> ---- mpirun has exited due to process rank 45 with PID 27168 on node
> compute-6-13.local exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in the
> job did. This can cause a job to hang indefinitely while it waits for
> all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> ----------------------------------------------------------------------
> ----
>
> I am using OpenMPI 1.6.5 compiled with the Intel 11.1-080 compilers.
>
> `uname -a` returns "Linux compute-1-1.local 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux"
>
> ibv_devinfo returns
> hca_id: mthca0
> transport: InfiniBand (0)
> fw_ver: 1.2.0
> node_guid: 0005:ad00:001f:fed8
> sys_image_guid: 0005:ad00:0100:d050
> vendor_id: 0x02c9
> vendor_part_id: 25204
> hw_ver: 0xA0
> board_id: MT_03B0120002
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 39
> port_lmc: 0x00
> link_layer: IB
>
>
> Any help in tracking down the problem is greatly appreciated.
>
> --
> T. Vince Grimes, Ph.D.
> CCC System Administrator
>
> Texas Tech University
> Dept. of Chemistry and Biochemistry (10A) Box 41061 Lubbock, TX
> 79409-1061
>
> (806) 834-0813 (voice); (806) 742-1289 (fax)
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/