Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-02-28 17:08:51

On Feb 28, 2008, at 2:45 PM, John Markus Bjørndalen wrote:

> Hi, and thanks for the feedback everyone.
> George Bosilca wrote:
>> Brian is completely right. Here is a more detailed description of
>> this
>> problem.
> [....]
>> On the other side, I hope that not many users write such
>> applications.
>> This is the best way to completely kill the performances of any MPI
>> implementation, by overloading one process with messages. This is
>> exactly what MPI_Reduce and MPI_Gather do, one process will get the
>> final result and all other processes only have to send some data.
>> This
>> behavior only arises when the gather or the reduce use a very flat
>> tree, and only for short messages. Because of the short messages
>> there
>> is no handshake between the sender and the receiver, which will make
>> all messages unexpected, and the flat tree guarantee that there will
>> be a lot of small messages. If you add a barrier every now and then
>> (100 iterations) this problem will never happens.
> I have done some more testing. Of the tested parameters, I'm observing
> this behaviour with group sizes from 16-44, and from 1 to 32768
> integers
> in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes
> 16-44 and from 1 to 4096 integers (per node).
> In other words, it actually happens with other tree configurations and
> larger packet sizes :-/

This is the limit for the rendez-vous protocol over TCP. And is the
upper limit where this problem will arise. I have a strong doubt that
is possible to create the same problem with messages larger than the
eager size of your BTL ...

> By the way, I'm also observing crashes with MPI_Broadcast (groups of
> size 4-44 with the root process (rank 0) broadcasting integer arrays
> of
> size 16384 and 32768). It looks like the root process is crashing.
> Can
> a sender crash because it runs out of buffer space as well?

I don't think the root crashed. I guess that one of the other nodes
crashed, the root got a bad socket (which is what the first error
message seems to indicate), and get terminated. As the output is not
synchronized between the nodes, one cannot rely on its order nor
contents. Moreover, mpirun report that the root was killed with signal
15, which is how we cleanup the remaining processes when we detect
that something really bad (like a seg fault) happened in the parallel

> ---------- snip --------------
> /home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4
> ./ompi-crash 16384 1 3000
> { 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' :
> 262144, 'iters' : 3000, 'bmno' : 1
> [compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed with errno=104
> mpirun noticed that job rank 0 with PID 16366 on node compute-0-0
> exited
> on signal 15 (Terminated).
> 3 additional processes aborted (not shown)
> ---------- snip --------------
>> One more thing, doing a lot of collective in a loop and computing the
>> total time is not the correct way to evaluate the cost of any
>> collective communication, simply because you will favor all
>> algorithms
>> based on pipelining. There is plenty of literature about this topic.
>> george.
> As I said in the original e-mail: I had only thrown them in for a
> bit of
> sanity checking. I expected funny numbers, but not that OpenMPI would
> crash.
> The original idea was just to make a quick comparison of Allreduce,
> Allgather and Alltoall in LAM and OpenMPI. The opportunity for
> pipelining the operations there is rather small since they can't get
> much out of phase with each other.

There are many differences between the routed and non routed
collectives. All errors that you reported so far are related to rooted
collectives, which make sense. I didn't state that it is normal that
Open MPI do not behave [sic]. I wonder if you can get such errors with
non routed collectives (such as allreduce, allgather and alltoall), or
with messages larger than the eager size ?

If you type "ompi_info --param btl tcp", you will see what is the
eager size for the TCP BTL. Everything smaller than this size will be
send eagerly; have the opportunity to became unexpected on the
receiver side and can lead to this problem. As a quick test, you can
add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and
this problem will not happen with for size over the 2K. This was the
original solution for the flow control problem. If you know your
application will generate thousands of unexpected messages, then you
should set the eager limit to zero.


> Regards,
> --
> // John Markus Bjørndalen
> //
> _______________________________________________
> users mailing list
> users_at_[hidden]

  • application/pkcs7-signature attachment: smime.p7s