Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-02-27 11:04:28


Bummer; ok.

On Feb 27, 2008, at 11:01 AM, Brian W. Barrett wrote:

> I played with this to fix some things in ORTE at one point, and it's
> a very dangerous slope -- you're essentially guaranteeing you have a
> deadlock case. Now instead of running off the stack, you'll
> deadlock. The issue is that we call opal_progress to wait for
> something to happen deep in the bowels of Open MPI. If you just
> don't call opal_progres, well, now that state change will never
> occur and you're really screwed. Instead of a quick error, you get
> infinite spin -- definitely a worse situations.
>
> The only solution to this problem is to suck it up and audit all the
> code to eliminate calls to opal_progress() in situations where
> infinite recursion can result. It's going to be long and painful,
> but there's no quick fix (IMHO).
>
> Brian
>
> On Wed, 27 Feb 2008, Jeff Squyres wrote:
>
>> Gleb / George --
>>
>> Is there an easy way for us to put a cap on max recusion down in
>> opal_progress? Just put in a counter in opal_progress() such that if
>> it exceeds some max value, return success without doing anything (if
>> opal_progress_event_flag indicates that nothing *needs* to be done)?
>>
>> This wouldn't *guarantee* that we cap recursion (because we sometimes
>> need progress before returning), but it might help for the cases that
>> are described in this mail (we're eagerly progressing when we don't
>> necessarily need to)...?
>>
>>
>> On Feb 22, 2008, at 9:48 AM, John Markus Bjørndalen wrote:
>>
>>> Hi,
>>>
>>> I ran into a bug when running a few microbenchmarks for OpenMPI. I
>>> had
>>> thrown in Reduce and Gather for sanity checking, but OpenMPI crashed
>>> when running those operations. Usually, this would happen when I
>>> reached
>>> around 12-16 nodes.
>>>
>>> My current crash-test code looks like this (I've removed a few lines
>>> that were commented out):
>>>
>>> ----------- snip-------------
>>> #include <mpi.h>
>>> #include <stdlib.h>
>>> #include <stdio.h>
>>> #include <assert.h>
>>> #include <unistd.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> int rank, size, count = 1;
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>
>>> if (argc > 1)
>>> count = atoi(argv[1]);
>>>
>>> int n, i;
>>> // Just make sure we have plenty of buffer for any operation
>>> int *sbuf = malloc(sizeof(int) * 2 * count);
>>> int *rbuf = malloc(sizeof(int) * 2 * count);
>>> assert(sbuf);
>>> assert(rbuf);
>>>
>>> for (n = 1; n <= 10000; n += 100) {
>>> printf("N = %d\n", n);
>>> fflush(stdout);
>>> for (i = 0; i < n; i++) {
>>> MPI_Reduce(sbuf, rbuf, count, MPI_INT, MPI_SUM, 0,
>>> MPI_COMM_WORLD);
>>> }
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> printf(" -- DONE\n");
>>> fflush(stdout);
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> }
>>> MPI_Finalize();
>>> return 0;
>>> }
>>> ----------- snip-------------
>>>
>>>
>>> I can usually trigger a crash with count=1, and n=1000 using 16+
>>> nodes,
>>> but I can also trigger it with 44 nodes and larger packets (around
>>> 32k
>>> ints I think). I can also crash it on a single host using 19
>>> processes,
>>> but then it usually doesn't crash until I reach somewhere between
>>> 1200-3000 iterations. Gather seems to have the same problems as
>>> Reduce.
>>>
>>> The output from running gdb on the coredump looks like this:
>>>
>>> ----------- snip-------------
>>> Using host libthread_db library "/lib/tls/libthread_db.so.1".
>>> Core was generated by `./ompi-crash2'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 0x00434184 in sysconf () from /lib/tls/libc.so.6
>>> #0 0x00434184 in sysconf () from /lib/tls/libc.so.6
>>> #1 0xb7e78b59 in _int_malloc () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #2 0xb7e799ce in malloc () from /home/johnm/local/ompi/lib/libopen-
>>> pal.so.0
>>> #3 0xb7f04852 in ompi_free_list_grow () from
>>> /home/johnm/local/ompi/lib/libmpi.so.0
>>> #4 0xb7d74e70 in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #5 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #6 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #7 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #8 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #9 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #10 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #11 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #12 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #13 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #14 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>>
>>> ... and then continues until...
>>>
>>> #1356848 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356849 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>>> #1356850 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #1356851 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356852 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356853 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356854 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>>> #1356855 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #1356856 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356857 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356858 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356859 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>>> #1356860 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>>> #1356861 0xb7e62b44 in opal_event_base_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356862 0xb7e62cff in opal_event_loop () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356863 0xb7e5d284 in opal_progress () from
>>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>> #1356864 0xb7d8cb69 in mca_pml_ob1_recv () from
>>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>>> #1356865 0xb7d5bb1c in ompi_coll_tuned_reduce_intra_basic_linear ()
>>> from
>>> /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
>>> #1356866 0xb7d55913 in ompi_coll_tuned_reduce_intra_dec_fixed ()
>>> from
>>> /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
>>> #1356867 0xb7f3db6c in PMPI_Reduce () from
>>> /home/johnm/local/ompi/lib/libmpi.so.0
>>> #1356868 0x0804899e in main (argc=1, argv=0xbfba8a84) at ompi-
>>> crash2.c:58
>>> ----------- snip-------------
>>>
>>> I poked around in the code, and it looks like the culprit might be
>>> in
>>> the macros that try to allocate fragments in
>>> mca_pml_ob1_recv_frag_match: MCA_PML_OB1_RECV_FRAG_ALLOC and
>>> MCA_PML_OB1_RECV_FRAG_INIT use OMPI_FREE_LIST_WAIT, which again can
>>> end
>>> up calling opal_condition_wait(). opal_condition_wait() calls
>>> opal_progress() to "block", which looks like it leads to infinite
>>> recursion in this case.
>>>
>>> I guess the problem is a race condition when one node is hammered
>>> with
>>> incoming packets.
>>>
>>> The stack trace contains about 1.35 million lines, so I won't
>>> include
>>> all of it here, but here's some statistics to verify that not much
>>> else
>>> is happening in that stack (I can make the full trace available if
>>> anybody needs it):
>>>
>>> ----------- snip-------------
>>> Number of callframes: 1356870
>>> Called function statistics (how often in stackdump):
>>> PMPI_Reduce 1
>>> _int_malloc 1
>>> main 1
>>> malloc 1
>>> mca_btl_tcp_endpoint_recv_handler 339197
>>> mca_pml_ob1_recv 1
>>> mca_pml_ob1_recv_frag_match 72
>>> ompi_coll_tuned_reduce_intra_basic_linear 1
>>> ompi_coll_tuned_reduce_intra_dec_fixed 1
>>> ompi_free_list_grow 1
>>> opal_event_base_loop 339197
>>> opal_event_loop 339197
>>> opal_progress 339197
>>> sysconf 2
>>> Address statistics (how often in stackdump), plus functions with
>>> that addr
>>> (sanity check):
>>> 0x00434184 2
>>> set(['sysconf'])
>>> 0x0804899e 1 set(['main'])
>>> 0xb7d55913 1
>>> set(['ompi_coll_tuned_reduce_intra_dec_fixed'])
>>> 0xb7d5bb1c 1
>>> set(['ompi_coll_tuned_reduce_intra_basic_linear'])
>>> 0xb7d74a7d 72
>>> set(['mca_btl_tcp_endpoint_recv_handler'])
>>> 0xb7d74e70 1
>>> set(['mca_btl_tcp_endpoint_recv_handler'])
>>> 0xb7d74f08 339124
>>> set(['mca_btl_tcp_endpoint_recv_handler'])
>>> 0xb7d8cb69 1
>>> set(['mca_pml_ob1_recv'])
>>> 0xb7d8f389 72
>>> set(['mca_pml_ob1_recv_frag_match'])
>>> 0xb7e5d284 339197
>>> set(['opal_progress'])
>>> 0xb7e62b44 339197
>>> set(['opal_event_base_loop'])
>>> 0xb7e62cff 339197
>>> set(['opal_event_loop'])
>>> 0xb7e78b59 1
>>> set(['_int_malloc'])
>>> 0xb7e799ce 1 set(['malloc'])
>>> 0xb7f04852 1
>>> set(['ompi_free_list_grow'])
>>> 0xb7f3db6c 1
>>> set(['PMPI_Reduce'])
>>> ----------- snip-------------
>>>
>>> I don't have any suggestions for a fix though, since this is the
>>> first
>>> time I've looked into the OpenMPI code.
>>>
>>> Btw. In case it makes a difference for triggering the bug: I'm
>>> running
>>> this on a cluster with 1 frontend and 44 nodes. The cluster runs
>>> Rocks
>>> 4.1, and each of the nodes are 3.2GHz P4 Prescott machines with 2GB
>>> RAM,
>>> connected with gigabit Ethernet.
>>>
>>>
>>> Regards,
>>>
>>> --
>>> // John Markus Bjørndalen
>>> // http://www.cs.uit.no/~johnm/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems