Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Problem running over IB with huge data set
From: Paul Kapinos (kapinos_at_[hidden])
Date: 2012-02-27 15:18:01


Hello Jeff, Ralph, All Open MPI folks,

We had an off-list discussion about an error in the Serpent program. Ralph said:

>We already have several tickets for that problem, each relating to a different
scenario:
>https://svn.open-mpi.org/trac/ompi/ticket/2155
>https://svn.open-mpi.org/trac/ompi/ticket/2157
>https://svn.open-mpi.org/trac/ompi/ticket/2295

I've build a quite small reproducer for the original issue (with a huge memory
footprint) and have send it to you.

The other week, another user got problemz if using huge data sets.

A program, which runs without any problem with smaller data sets (in order of
24Gb data in total and smaller), got problem with huge data sets (in order of
100Gb data in total and more),
_if running over infiniband or IPoIB_.

The program essentially hangs, mostly blocking the transport used. In some
scenarios it crash.
The same program and data set run fine over ethernet or shared memory (yes,
we've computers with 100ths of GB of memory). The behaviour is reproducible.

Diverse errors are produced, some of them are listed below. Another thing is
that in the most cases, if the program hangs, it also blocks the transport, that
is another programs cannot run over the same interface (just as reported earlier).

More fun: we also found some '#procs x #Nodes' combinations where the program
run fine.

I.e.,
30 and 60 processes over 6 nodes run through fine,
6 procs over 6 nodes - killed with a error message (see below)
12,18,36,61,62,64,66 procs over 6 nodes - hangs and block the interface.

Well, we cannot give any warranty that that isn't a bug in the program itself,
because it is just in development now. However, since the program works well for
smaller sized data sets and over TCP and over ShMem, it smells like a MPI
library error, thus this mail.

Or maybe the puzzling behaviour may be a follow-up of any bugs in the program
itself? If yes, what it could be and how we could try no find it?

I did not attach a reproducer to this mail because the user do not want to
spread the code all over the world, but can send it to you if you are interested
in reproducing it. [The code is about matrix transpose of huge matrices and
essentially calls MPI_Alltoallv, it is written a 'nice, well-structured' C++
code (nothing stays unwrapped) but is pretty small and readable].

Ralph, Jeff, anybody - any interest in reproducing this issue?

Best wishes,
Paul Kapinos

P.S. Open MPI 1.5.3 used - still waiting for 1.5.5 ;-)

Some error messages:

with 6 procs over 6 Nodes:
------------------------------------------------------------------------------
mlx4: local QP operation err (QPN 7c0063, WQE index 0, vendor syndrome 6f,
opcode = 5e)
[[8771,1],5][btl_openib_component.c:3316:handle_wc] from
linuxbdc07.rz.RWTH-Aachen.DE to: linuxbdc04 error polling LP CQ with status
LOCAL QP OPERATION ERROR status number 2 for wr_id 6afb70 opcode 0 vendor error
111 qp_idx 3
mlx4: local QP operation err (QPN 18005f, WQE index 0, vendor syndrome 6f,
opcode = 5e)
[[8771,1],2][btl_openib_component.c:3316:handle_wc] from
linuxbdc03.rz.RWTH-Aachen.DE to: linuxbdc02 error polling LP CQ with status
LOCAL QP OPERATION ERROR status number 2 for wr_id 6afb70 opcode 0 vendor error
111 qp_idx 3
[[8771,1],1][btl_openib_component.c:3316:handle_wc] from
linuxbdc02.rz.RWTH-Aachen.DE to: linuxbdc01 error polling LP CQ with status
LOCAL QP OPERATION ERROR status number 2 for wr_id 6afb70 opcode 0 vendor error
111 qp_idx 3
mlx4: local QP operation err (QPN 340057, WQE index 0, vendor syndrome 6f,
opcode = 5e)
------------------------------------------------------------------------------

with 61 processes using IPoIB:
mpiexec -mca btl ^openib -np 61 -host 1,2,3,4,5,6 a.out < dim100G.in
------------------------------------------------------------------------------
[linuxbdc02.rz.RWTH-Aachen.DE][[21403,1],1][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 134.61.208.202 failed: Connection timed out (110)
[linuxbdc01.rz.RWTH-Aachen.DE][[21403,1],18][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 134.61.208.203 failed: Connection timed out (110)
[linuxbdc01.rz.RWTH-Aachen.DE][[21403,1],18][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 134.61.208.203 failed: Connection timed out (110)
------------------------------------------------------------------------------

-- 
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915