Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-02-27 14:32:22


Brian is completely right. Here is a more detailed description of this
problem.

Upon receiving a fragment from the BTL (lower layer) we try to match
it with an MPI request. If the match fails, then we get a fragment
from the free_list (via the blocking call to FREE_LIST_WAIT) and copy
the data inside. When a node is flooded with unexpected messages, we
will end up using the whole memory of the host (or reaching the
maximum limit of the free list) and will be unable to allocate the
unexpected fragment. In order to release some elements from the free
list we have to do some progress, which is described here by the call
to opal_progress. This is actually quite bad, as progressing the
network will NOT release any unexpected messages. Instead, we should
go up to the matching layer, and make sure that the original request
can now be matched, which will give us one spot in the free list.

The problem with this new approach is that the BTL will have to be
modified to be able to retain a fragment if the PML was unable to take
care of it (i.e. copy it in the user memory or in the preallocated
space used for unexpected messages).

On the other side, I hope that not many users write such applications.
This is the best way to completely kill the performances of any MPI
implementation, by overloading one process with messages. This is
exactly what MPI_Reduce and MPI_Gather do, one process will get the
final result and all other processes only have to send some data. This
behavior only arises when the gather or the reduce use a very flat
tree, and only for short messages. Because of the short messages there
is no handshake between the sender and the receiver, which will make
all messages unexpected, and the flat tree guarantee that there will
be a lot of small messages. If you add a barrier every now and then
(100 iterations) this problem will never happens.

One more thing, doing a lot of collective in a loop and computing the
total time is not the correct way to evaluate the cost of any
collective communication, simply because you will favor all algorithms
based on pipelining. There is plenty of literature about this topic.

   george.

On Feb 27, 2008, at 11:01 AM, Brian W. Barrett wrote:

> I played with this to fix some things in ORTE at one point, and it's
> a very dangerous slope -- you're essentially guaranteeing you have a
> deadlock case. Now instead of running off the stack, you'll
> deadlock. The issue is that we call opal_progress to wait for
> something to happen deep in the bowels of Open MPI. If you just
> don't call opal_progres, well, now that state change will never
> occur and you're really screwed. Instead of a quick error, you get
> infinite spin -- definitely a worse situations.
>
> The only solution to this problem is to suck it up and audit all the
> code to eliminate calls to opal_progress() in situations where
> infinite recursion can result. It's going to be long and painful,
> but there's no quick fix (IMHO).
>
> Brian
>
> On Wed, 27 Feb 2008, Jeff Squyres wrote:
>
>> Gleb / George --
>>
>> Is there an easy way for us to put a cap on max recusion down in
>> opal_progress? Just put in a counter in opal_progress() such that if
>> it exceeds some max value, return success without doing anything (if
>> opal_progress_event_flag indicates that nothing *needs* to be done)?
>>
>> This wouldn't *guarantee* that we cap recursion (because we sometimes
>> need progress before returning), but it might help for the cases that
>> are described in this mail (we're eagerly progressing when we don't
>> necessarily need to)...?
>>
>>
>> On Feb 22, 2008, at 9:48 AM, John Markus Bjørndalen wrote:
>>
>>> Hi,
>>>
>>> I ran into a bug when running a few microbenchmarks for OpenMPI. I
>>> had
>>> thrown in Reduce and Gather for sanity checking, but OpenMPI crashed
>>> when running those operations. Usually, this would happen when I
>>> reached
>>> around 12-16 nodes.
>>>
>>> My current crash-test code looks like this (I've removed a few lines
>>> that were commented out):
>>>
>>> ----------- snip-------------
>>> #include <mpi.h>
>>> #include <stdlib.h>
>>> #include <stdio.h>
>>> #include <assert.h>
>>> #include <unistd.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> int rank, size, count = 1;
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>
>>> if (argc > 1)
>>> count = atoi(argv[1]);
>>>
>>> int n, i;
>>> // Just make sure we have plenty of buffer for any operation
>>> int *sbuf = malloc(sizeof(int) * 2 * count);
>>> int *rbuf = malloc(sizeof(int) * 2 * count);
>>> assert(sbuf);
>>> assert(rbuf);
>>>
>>> for (n = 1; n <= 10000; n += 100) {
>>> printf("N = %d\n", n);
>>> fflush(stdout);
>>> for (i = 0; i < n; i++) {
>>> MPI_Reduce(sbuf, rbuf, count, MPI_INT, MPI_SUM, 0,
>>> MPI_COMM_WORLD);
>>> }
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> printf(" -- DONE\n");
>>> fflush(stdout);
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> }
>>> MPI_Finalize();
>>> return 0;
>>> }
>>> ----------- snip-------------
>>>
>>>
>>> I can usually trigger a crash with count=1, and n=1000 using 16+
>>> nodes,
>>> but I can also trigger it with 44 nodes and larger packets (around
>>> 32k
>>> ints I think). I can also crash it on a single host using 19
>>> processes,
>>> but then it usually doesn't crash until I reach somewhere between
>>> 1200-3000 iterations. Gather seems to have the same problems as
>>> Reduce.
>>>
>>> The output from running gdb on the coredump looks like this:
>>>
>>> ----------- snip-------------
>>> Using host libthread_db library "/lib/tls/libthread_db.so.1".
>>> Core was generated by `./ompi-crash2'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 0x00434184 in sysconf () from /lib/tls/libc.so.6
>>> #0 0x00434184 in sysconf () from /lib/tls/libc.so.6
>>> #1 0xb7e78b59 in _int_malloc () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #2 0xb7e799ce in malloc () from /home/johnm/local/ompi/lib/libopen-
>>> pal.so.0
>>> #3 0xb7f04852 in ompi_free_list_grow () from
>>> /home/johnm/local/ompi/lib/libmpi.so.0
>>> #4 0xb7d74e70 in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #5 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #6 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #7 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #8 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #9 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #10 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #11 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #12 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #13 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #14 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>>
>>> ... and then continues until...
>>>
>>> #1356848 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356849 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>>> #1356850 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #1356851 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356852 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356853 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356854 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>>> #1356855 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #1356856 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356857 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356858 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356859 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>>> #1356860 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #1356861 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356862 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356863 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356864 0xb7d8cb69 in mca_pml_ob1_recv () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>>> #1356865 0xb7d5bb1c in ompi_coll_tuned_reduce_intra_basic_linear ()
>>> from
>>> /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
>>> #1356866 0xb7d55913 in ompi_coll_tuned_reduce_intra_dec_fixed ()
>>> from
>>> /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
>>> #1356867 0xb7f3db6c in PMPI_Reduce () from
>>> /home/johnm/local/ompi/lib/libmpi.so.0
>>> #1356868 0x0804899e in main (argc=1, argv=0xbfba8a84) at ompi-
>>> crash2.c:58
>>> ----------- snip-------------
>>>
>>> I poked around in the code, and it looks like the culprit might be
>>> in
>>> the macros that try to allocate fragments in
>>> mca_pml_ob1_recv_frag_match: MCA_PML_OB1_RECV_FRAG_ALLOC and
>>> MCA_PML_OB1_RECV_FRAG_INIT use OMPI_FREE_LIST_WAIT, which again can
>>> end
>>> up calling opal_condition_wait(). opal_condition_wait() calls
>>> opal_progress() to "block", which looks like it leads to infinite
>>> recursion in this case.
>>>
>>> I guess the problem is a race condition when one node is hammered
>>> with
>>> incoming packets.
>>>
>>> The stack trace contains about 1.35 million lines, so I won't
>>> include
>>> all of it here, but here's some statistics to verify that not much
>>> else
>>> is happening in that stack (I can make the full trace available if
>>> anybody needs it):
>>>
>>> ----------- snip-------------
>>> Number of callframes: 1356870
>>> Called function statistics (how often in stackdump):
>>> PMPI_Reduce 1
>>> _int_malloc 1
>>> main 1
>>> malloc 1
>>> mca_btl_tcp_endpoint_recv_handler 339197
>>> mca_pml_ob1_recv 1
>>> mca_pml_ob1_recv_frag_match 72
>>> ompi_coll_tuned_reduce_intra_basic_linear 1
>>> ompi_coll_tuned_reduce_intra_dec_fixed 1
>>> ompi_free_list_grow 1
>>> opal_event_base_loop 339197
>>> opal_event_loop 339197
>>> opal_progress 339197
>>> sysconf 2
>>> Address statistics (how often in stackdump), plus functions with
>>> that addr
>>> (sanity check):
>>> 0x00434184 2
>>> set(['sysconf'])
>>> 0x0804899e 1 set(['main'])
>>> 0xb7d55913 1
>>> set(['ompi_coll_tuned_reduce_intra_dec_fixed'])
>>> 0xb7d5bb1c 1
>>> set(['ompi_coll_tuned_reduce_intra_basic_linear'])
>>> 0xb7d74a7d 72
>>> set(['mca_btl_tcp_endpoint_recv_handler'])
>>> 0xb7d74e70 1
>>> set(['mca_btl_tcp_endpoint_recv_handler'])
>>> 0xb7d74f08 339124
>>> set(['mca_btl_tcp_endpoint_recv_handler'])
>>> 0xb7d8cb69 1
>>> set(['mca_pml_ob1_recv'])
>>> 0xb7d8f389 72
>>> set(['mca_pml_ob1_recv_frag_match'])
>>> 0xb7e5d284 339197
>>> set(['opal_progress'])
>>> 0xb7e62b44 339197
>>> set(['opal_event_base_loop'])
>>> 0xb7e62cff 339197
>>> set(['opal_event_loop'])
>>> 0xb7e78b59 1
>>> set(['_int_malloc'])
>>> 0xb7e799ce 1 set(['malloc'])
>>> 0xb7f04852 1
>>> set(['ompi_free_list_grow'])
>>> 0xb7f3db6c 1
>>> set(['PMPI_Reduce'])
>>> ----------- snip-------------
>>>
>>> I don't have any suggestions for a fix though, since this is the
>>> first
>>> time I've looked into the OpenMPI code.
>>>
>>> Btw. In case it makes a difference for triggering the bug: I'm
>>> running
>>> this on a cluster with 1 frontend and 44 nodes. The cluster runs
>>> Rocks
>>> 4.1, and each of the nodes are 3.2GHz P4 Prescott machines with 2GB
>>> RAM,
>>> connected with gigabit Ethernet.
>>>
>>>
>>> Regards,
>>>
>>> --
>>> // John Markus Bjørndalen
>>> // http://www.cs.uit.no/~johnm/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pkcs7-signature attachment: smime.p7s