Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] sending/receiving large buffers
From: Martin Siegert (siegert_at_[hidden])
Date: 2009-11-08 23:40:23


Hi,

I am running into a problem with mpi_allreduce when large buffers
are used. But does not appear to be unique for mpi_allreduce; it
occurs with mpi_send/mpi_recv as well; program is attached.
1) run this using MPI_Allreduce:

# mpiexec -machinefile mfile -n 2 ./allreduce
choose algorithm: enter 1 for MPI_Allreduce
                  enter 2 for MPI_Send/Recv and MPI_Bcast
1
enter array size (integer; negative to stop):
40000000
allreduce completed 0.661867
enter array size (integer; negative to stop):
80000000
allreduce completed 1.356263
enter array size (integer; negative to stop):
160000000
allreduce completed 2.700941
enter array size (integer; negative to stop):
320000000

At this point the program just hangs forever.

2) running the MPI_Send/MPI_Recv/MPI_Bcast version:

# mpiexec -machinefile mfile -n 2 ./allreduce
choose algorithm: enter 1 for MPI_Allreduce
                  enter 2 for MPI_Send/Recv and MPI_Bcast
2
enter array size (integer; negative to stop):
40000000
id=0 received data from id=1 in 0.263818
bcast completed in 0.652631
allreduce completed in 1.102356
enter array size (integer; negative to stop):
80000000
id=0 received data from id=1 in 0.671201
bcast completed in 1.298208
allreduce completed in 2.341906
enter array size (integer; negative to stop):
160000000
[[43618,1],0][btl_openib_component.c:2951:handle_wc] from b2 to: b1 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 102347120 opcode 1 vendor error 105 qp_idx 3
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 26254 on
node b2 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------

All programs/libraries are 64bit, interconnect is IB.
I expect problems with sizes larger than 2^31-1, but these array sizes
are still much smaller.

What is the problem here?

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert_at_[hidden]
Canada  V5A 1S6