Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
From: John Markus Bjørndalen (jmb_at_[hidden])
Date: 2008-02-28 14:45:21

Hi, and thanks for the feedback everyone.

George Bosilca wrote:
> Brian is completely right. Here is a more detailed description of this
> problem.
> On the other side, I hope that not many users write such applications.
> This is the best way to completely kill the performances of any MPI
> implementation, by overloading one process with messages. This is
> exactly what MPI_Reduce and MPI_Gather do, one process will get the
> final result and all other processes only have to send some data. This
> behavior only arises when the gather or the reduce use a very flat
> tree, and only for short messages. Because of the short messages there
> is no handshake between the sender and the receiver, which will make
> all messages unexpected, and the flat tree guarantee that there will
> be a lot of small messages. If you add a barrier every now and then
> (100 iterations) this problem will never happens.
I have done some more testing. Of the tested parameters, I'm observing
this behaviour with group sizes from 16-44, and from 1 to 32768 integers
in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes
16-44 and from 1 to 4096 integers (per node).

In other words, it actually happens with other tree configurations and
larger packet sizes :-/

By the way, I'm also observing crashes with MPI_Broadcast (groups of
size 4-44 with the root process (rank 0) broadcasting integer arrays of
size 16384 and 32768). It looks like the root process is crashing. Can
a sender crash because it runs out of buffer space as well?

---------- snip --------------
/home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4
./ompi-crash 16384 1 3000
{ 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' :
262144, 'iters' : 3000, 'bmno' : 1
mca_btl_tcp_frag_recv: readv failed with errno=104
mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exited
on signal 15 (Terminated).
3 additional processes aborted (not shown)
---------- snip --------------
> One more thing, doing a lot of collective in a loop and computing the
> total time is not the correct way to evaluate the cost of any
> collective communication, simply because you will favor all algorithms
> based on pipelining. There is plenty of literature about this topic.
> george.
As I said in the original e-mail: I had only thrown them in for a bit of
sanity checking. I expected funny numbers, but not that OpenMPI would

The original idea was just to make a quick comparison of Allreduce,
Allgather and Alltoall in LAM and OpenMPI. The opportunity for
pipelining the operations there is rather small since they can't get
much out of phase with each other.


// John Markus Bjørndalen