Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Keith Refson (Keith.lists_at_[hidden])
Date: 2007-04-20 05:19:00


Dear OMPI list,

I'm running into a problem with openmpi 1.2 where a MPI program is crashing with

local QP operation err (QPN 380404, WQE @ 00000583, CQN 040085, index 1147949)
  [ 0] 00380404
  [ 4] 00000000
  [ 8] 00000000
  [ c] 00000000
  [10] 026f0000
  [14] 00000000
  [18] 00000583
  [1c] ff000000
[0,1,0][btl_openib_component.c:1195:btl_openib_component_progress] from n0001.yquem to: n0002.yquem
error polling HP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 42714736 opcode 0

Can someone interpret this for me or suggest how to obtain any more
useful information? My guess is that the cause is running out of
buffer space. If so is this a bug or limit in open-mpi?

The machine is a dual 2.66GHz Xeon cluster with Infiniband.

Some background: The error occurs in a test case I run widely for a
large electronic structure code, and is in the routine that gathers a
large quantity of data from all of the processors in the run into the
root node to write an output file. Each processor MPI_Send()s a
number of blocks of data to root, which MPI_Recv()s in nested loops
over blocks and remote nodes.

We have had problems in the past with the volume of data overwhelming
other MPI implementations' buffer cache space during this step, and in
response to this there is a synchronization step which causes the
remote nodes to wait on a blocking recv for a "go ahead and send"
message from root. Using this the number of data blocks (messages)
sent at once can be controlled.

With the default of 32 at once, running on 16 nodes (so with
potentially 15x32 480 outstanding messages at a time) the crash
occurs. Restricting the number of blocks/node to 16 (ie 240
pending messages) gives a successful run with no crash.

Version 1.2 of openmpi seems better than 1.1.5 in this respect, which
always crashes on the 16-node run even with only 1 message sent at
once from each processor. For some reason ompi 1.1.5 gives a better
traceback too....

local QP operation err (QPN 180408, WQE @ 00000703, CQN 140085, index 1309215)
  [ 0] 00180408
  [ 4] 00000000
  [ 8] 00000000
  [ c] 00000000
  [10] 026f0000
  [14] 00000000
  [18] 00000703
  [1c] ff000000
[0,1,0][btl_openib_component.c:897:mca_btl_openib_component_progress] from n0001.yquem to:
n0002.yquem error polling HP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id
40618448 opcode 0
Signal:6 info.si_errno:0(Success) si_code:-6()
[0] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libopal.so.0 [0x2a95fc404c]
[1] func:/lib64/tls/libpthread.so.0 [0x2a95a12430]
[2] func:/lib64/tls/libc.so.6(gsignal+0x3d) [0x2a965d421d]
[3] func:/lib64/tls/libc.so.6(abort+0xfe) [0x2a965d5a1e]
[4]
func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_btl_openib_component_progress+0x751)
[0x2a95be09d3]
[5] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_bml_r2_progress+0x3a)
[0x2a95bd48fc]
[6] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libopal.so.0(opal_progress+0x80) [0x2a95faaa06]
[7] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_pml_ob1_recv+0x329)
[0x2a95c2e679]
[8] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(PMPI_Recv+0x22e) [0x2a95bbdbd2]
[9] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(pmpi_recv_+0xd9) [0x2a95bcfbdd]
[10] func:/home/krefson/bin/castep-4.1b(comms_mp_comms_recv_integer_+0x45) [0x10e5ae9]
...

I'd appreciate an opinion on whether the problem is in OpenMPI or not and
what's the best way to proceed.

Keith Refson