Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
From: John Markus Bjørndalen (jmb_at_[hidden])
Date: 2008-02-29 10:20:09


George Bosilca wrote:

[.....]
>
> I don't think the root crashed. I guess that one of the other nodes
> crashed, the root got a bad socket (which is what the first error
> message seems to indicate), and get terminated. As the output is not
> synchronized between the nodes, one cannot rely on its order nor
> contents. Moreover, mpirun report that the root was killed with signal
> 15, which is how we cleanup the remaining processes when we detect
> that something really bad (like a seg fault) happened in the parallel
> application.
>
Sorry, I should have rephrased that as a question ("is it the root?").
I'm not that familiar with the debug output of OpenMPI yet, so I
included it in case somebody made more sense of it than me.

>
> There are many differences between the routed and non routed
> collectives. All errors that you reported so far are related to rooted
> collectives, which make sense. I didn't state that it is normal that
> Open MPI do not behave [sic]. I wonder if you can get such errors with
> non routed collectives (such as allreduce, allgather and alltoall), or
> with messages larger than the eager size ?
You're right, I haven't seen any crashes with the All*-variants.

TCP eager limit is set to 65536 (output from ompi_info):

     MCA btl: parameter "btl_tcp_eager_limit" (current value: "65536")
     MCA btl: parameter "btl_tcp_min_send_size" (current value: "65536")
     MCA btl: parameter "btl_tcp_max_send_size" (current value: "131072")

I observed crashes with Broadcasts and Reduces of 131072 bytes. I'm
playing around with larger messages now, and while Reduce with 16 nodes
seem stable at 262144 byte messages, it still crashes with 44 nodes.

>
> If you type "ompi_info --param btl tcp", you will see what is the
> eager size for the TCP BTL. Everything smaller than this size will be
> send eagerly; have the opportunity to became unexpected on the
> receiver side and can lead to this problem. As a quick test, you can
> add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and
> this problem will not happen with for size over the 2K. This was the
> original solution for the flow control problem. If you know your
> application will generate thousands of unexpected messages, then you
> should set the eager limit to zero.
I tried running Reduce with 4096 ints (16384 bytes), 16 nodes and eager
limit 2048:

mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 2048
./ompi-crash 4096 2 3000
{ 'groupsize' : 16, 'count' : 4096, 'bytes' : 16384, 'bufbytes' :
262144, 'iters' : 3000, 'bmno' : 2
[compute-2-2][0,1,10][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
[compute-3-2][0,1,14][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104
mca_btl_tcp_frag_recv: readv failed with errno=104
mpirun noticed that job rank 0 with PID 30407 on node compute-0-0 exited
on signal 15 (Terminated).
15 additional processes aborted (not shown)

This one tries to run Reduce with 1 integer per node and also crashes
(with eager size 0):

/mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 0
./ompi-crash 1 2 3000
...

This is puzzling.

I'm mostly familiarizing myself with OpenMPI at the moment as well as
poking around to see how the collective operations work and perform
compared to LAM. Partly because I have some ideas I'd like to test out,
and partly because I'm considering to move some student exercises over
from LAM to OpenMPI. I don't expect to write actual applications that
treat MPI like this myself, but on the other hand, not having to do
throttling on top of MPI could be an advantage in some application
patterns.

Regards,

-- 
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/