Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Vince Grimes (tom.grimes_at_[hidden])
Date: 2014-03-05 17:21:51

OpenMPI folks:

        I am having trouble running a specific program (ScalIT, a code produced
and maintained by one of the research groups here at TTU) using
Infiniband. There is conflicting information that has made it impossible
to diagnose the problem:

1) Other programs (like NWChem) run using OpenMPI over multiple nodes
using Infiniband without any problems at all.

2) ScalIT runs on other clusters (and I believe with OpenMPI) without error.

3) ScalIT runs with OpenMPI on a single node without error.

4) ScalIT dies at a particular place with OpenMPI over multiple nodes
(20) with OpenMPI.

I don't know whether it is a hardware problem (but other codes work just
fine) or a programming error in ScalIT (but it works without
modification on other clusters).

The error I am getting is:
local QP operation err (QPN 0014bc, WQE @ 00009005, CQN 000097, index
   [ 0] 000014bc
   [ 4] 00000000
   [ 8] 00000000
   [ c] 00000000
   [10] 026f3410
   [14] 00000000
   [18] 00009005
   [1c] ff100000
[[44095,1],45][btl_openib_component.c:3492:handle_wc] from
compute-6-13.local to: compute-3-11 error polling LP CQ with status
LOCAL QP OPERATION ERROR status number 2 for wr_id 40c5e00 opcode 0
vendor error 111 qp_idx 0
mpirun has exited due to process rank 45 with PID 27168 on
node compute-6-13.local exiting improperly. There are two reasons this
could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

I am using OpenMPI 1.6.5 compiled with the Intel 11.1-080 compilers.

`uname -a` returns "Linux compute-1-1.local 2.6.32-279.14.1.el6.x86_64
#1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux"

ibv_devinfo returns
hca_id: mthca0
         transport: InfiniBand (0)
         fw_ver: 1.2.0
         node_guid: 0005:ad00:001f:fed8
         sys_image_guid: 0005:ad00:0100:d050
         vendor_id: 0x02c9
         vendor_part_id: 25204
         hw_ver: 0xA0
         board_id: MT_03B0120002
         phys_port_cnt: 1
                 port: 1
                         state: PORT_ACTIVE (4)
                         max_mtu: 2048 (4)
                         active_mtu: 2048 (4)
                         sm_lid: 1
                         port_lid: 39
                         port_lmc: 0x00
                         link_layer: IB

Any help in tracking down the problem is greatly appreciated.

T. Vince Grimes, Ph.D.
CCC System Administrator
Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061
(806) 834-0813 (voice);     (806) 742-1289 (fax)