Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2008-02-27 11:01:06


I played with this to fix some things in ORTE at one point, and it's a
very dangerous slope -- you're essentially guaranteeing you have a
deadlock case. Now instead of running off the stack, you'll deadlock.
The issue is that we call opal_progress to wait for something to happen
deep in the bowels of Open MPI. If you just don't call opal_progres,
well, now that state change will never occur and you're really screwed.
Instead of a quick error, you get infinite spin -- definitely a worse
situations.

The only solution to this problem is to suck it up and audit all the code
to eliminate calls to opal_progress() in situations where infinite
recursion can result. It's going to be long and painful, but there's no
quick fix (IMHO).

Brian

On Wed, 27 Feb 2008, Jeff Squyres wrote:

> Gleb / George --
>
> Is there an easy way for us to put a cap on max recusion down in
> opal_progress? Just put in a counter in opal_progress() such that if
> it exceeds some max value, return success without doing anything (if
> opal_progress_event_flag indicates that nothing *needs* to be done)?
>
> This wouldn't *guarantee* that we cap recursion (because we sometimes
> need progress before returning), but it might help for the cases that
> are described in this mail (we're eagerly progressing when we don't
> necessarily need to)...?
>
>
> On Feb 22, 2008, at 9:48 AM, John Markus Bjørndalen wrote:
>
>> Hi,
>>
>> I ran into a bug when running a few microbenchmarks for OpenMPI. I had
>> thrown in Reduce and Gather for sanity checking, but OpenMPI crashed
>> when running those operations. Usually, this would happen when I
>> reached
>> around 12-16 nodes.
>>
>> My current crash-test code looks like this (I've removed a few lines
>> that were commented out):
>>
>> ----------- snip-------------
>> #include <mpi.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <assert.h>
>> #include <unistd.h>
>>
>> int main(int argc, char *argv[])
>> {
>> int rank, size, count = 1;
>> MPI_Init(&argc, &argv);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>
>> if (argc > 1)
>> count = atoi(argv[1]);
>>
>> int n, i;
>> // Just make sure we have plenty of buffer for any operation
>> int *sbuf = malloc(sizeof(int) * 2 * count);
>> int *rbuf = malloc(sizeof(int) * 2 * count);
>> assert(sbuf);
>> assert(rbuf);
>>
>> for (n = 1; n <= 10000; n += 100) {
>> printf("N = %d\n", n);
>> fflush(stdout);
>> for (i = 0; i < n; i++) {
>> MPI_Reduce(sbuf, rbuf, count, MPI_INT, MPI_SUM, 0,
>> MPI_COMM_WORLD);
>> }
>> MPI_Barrier(MPI_COMM_WORLD);
>> printf(" -- DONE\n");
>> fflush(stdout);
>> MPI_Barrier(MPI_COMM_WORLD);
>> }
>> MPI_Finalize();
>> return 0;
>> }
>> ----------- snip-------------
>>
>>
>> I can usually trigger a crash with count=1, and n=1000 using 16+
>> nodes,
>> but I can also trigger it with 44 nodes and larger packets (around
>> 32k
>> ints I think). I can also crash it on a single host using 19
>> processes,
>> but then it usually doesn't crash until I reach somewhere between
>> 1200-3000 iterations. Gather seems to have the same problems as
>> Reduce.
>>
>> The output from running gdb on the coredump looks like this:
>>
>> ----------- snip-------------
>> Using host libthread_db library "/lib/tls/libthread_db.so.1".
>> Core was generated by `./ompi-crash2'.
>> Program terminated with signal 11, Segmentation fault.
>> #0 0x00434184 in sysconf () from /lib/tls/libc.so.6
>> #0 0x00434184 in sysconf () from /lib/tls/libc.so.6
>> #1 0xb7e78b59 in _int_malloc () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #2 0xb7e799ce in malloc () from /home/johnm/local/ompi/lib/libopen-
>> pal.so.0
>> #3 0xb7f04852 in ompi_free_list_grow () from
>> /home/johnm/local/ompi/lib/libmpi.so.0
>> #4 0xb7d74e70 in mca_btl_tcp_endpoint_recv_handler () from
>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>> #5 0xb7e62b44 in opal_event_base_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #6 0xb7e62cff in opal_event_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #7 0xb7e5d284 in opal_progress () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #8 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>> #9 0xb7e62b44 in opal_event_base_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #10 0xb7e62cff in opal_event_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #11 0xb7e5d284 in opal_progress () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #12 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>> #13 0xb7e62b44 in opal_event_base_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #14 0xb7e62cff in opal_event_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>>
>> ... and then continues until...
>>
>> #1356848 0xb7e5d284 in opal_progress () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356849 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>> #1356850 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>> #1356851 0xb7e62b44 in opal_event_base_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356852 0xb7e62cff in opal_event_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356853 0xb7e5d284 in opal_progress () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356854 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>> #1356855 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>> #1356856 0xb7e62b44 in opal_event_base_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356857 0xb7e62cff in opal_event_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356858 0xb7e5d284 in opal_progress () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356859 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>> #1356860 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
>> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
>> #1356861 0xb7e62b44 in opal_event_base_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356862 0xb7e62cff in opal_event_loop () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356863 0xb7e5d284 in opal_progress () from
>> /home/johnm/local/ompi/lib/libopen-pal.so.0
>> #1356864 0xb7d8cb69 in mca_pml_ob1_recv () from
>> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
>> #1356865 0xb7d5bb1c in ompi_coll_tuned_reduce_intra_basic_linear ()
>> from
>> /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
>> #1356866 0xb7d55913 in ompi_coll_tuned_reduce_intra_dec_fixed () from
>> /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
>> #1356867 0xb7f3db6c in PMPI_Reduce () from
>> /home/johnm/local/ompi/lib/libmpi.so.0
>> #1356868 0x0804899e in main (argc=1, argv=0xbfba8a84) at ompi-
>> crash2.c:58
>> ----------- snip-------------
>>
>> I poked around in the code, and it looks like the culprit might be in
>> the macros that try to allocate fragments in
>> mca_pml_ob1_recv_frag_match: MCA_PML_OB1_RECV_FRAG_ALLOC and
>> MCA_PML_OB1_RECV_FRAG_INIT use OMPI_FREE_LIST_WAIT, which again can
>> end
>> up calling opal_condition_wait(). opal_condition_wait() calls
>> opal_progress() to "block", which looks like it leads to infinite
>> recursion in this case.
>>
>> I guess the problem is a race condition when one node is hammered with
>> incoming packets.
>>
>> The stack trace contains about 1.35 million lines, so I won't include
>> all of it here, but here's some statistics to verify that not much
>> else
>> is happening in that stack (I can make the full trace available if
>> anybody needs it):
>>
>> ----------- snip-------------
>> Number of callframes: 1356870
>> Called function statistics (how often in stackdump):
>> PMPI_Reduce 1
>> _int_malloc 1
>> main 1
>> malloc 1
>> mca_btl_tcp_endpoint_recv_handler 339197
>> mca_pml_ob1_recv 1
>> mca_pml_ob1_recv_frag_match 72
>> ompi_coll_tuned_reduce_intra_basic_linear 1
>> ompi_coll_tuned_reduce_intra_dec_fixed 1
>> ompi_free_list_grow 1
>> opal_event_base_loop 339197
>> opal_event_loop 339197
>> opal_progress 339197
>> sysconf 2
>> Address statistics (how often in stackdump), plus functions with
>> that addr
>> (sanity check):
>> 0x00434184 2 set(['sysconf'])
>> 0x0804899e 1 set(['main'])
>> 0xb7d55913 1
>> set(['ompi_coll_tuned_reduce_intra_dec_fixed'])
>> 0xb7d5bb1c 1
>> set(['ompi_coll_tuned_reduce_intra_basic_linear'])
>> 0xb7d74a7d 72
>> set(['mca_btl_tcp_endpoint_recv_handler'])
>> 0xb7d74e70 1
>> set(['mca_btl_tcp_endpoint_recv_handler'])
>> 0xb7d74f08 339124
>> set(['mca_btl_tcp_endpoint_recv_handler'])
>> 0xb7d8cb69 1
>> set(['mca_pml_ob1_recv'])
>> 0xb7d8f389 72
>> set(['mca_pml_ob1_recv_frag_match'])
>> 0xb7e5d284 339197
>> set(['opal_progress'])
>> 0xb7e62b44 339197
>> set(['opal_event_base_loop'])
>> 0xb7e62cff 339197
>> set(['opal_event_loop'])
>> 0xb7e78b59 1
>> set(['_int_malloc'])
>> 0xb7e799ce 1 set(['malloc'])
>> 0xb7f04852 1
>> set(['ompi_free_list_grow'])
>> 0xb7f3db6c 1
>> set(['PMPI_Reduce'])
>> ----------- snip-------------
>>
>> I don't have any suggestions for a fix though, since this is the first
>> time I've looked into the OpenMPI code.
>>
>> Btw. In case it makes a difference for triggering the bug: I'm running
>> this on a cluster with 1 frontend and 44 nodes. The cluster runs Rocks
>> 4.1, and each of the nodes are 3.2GHz P4 Prescott machines with 2GB
>> RAM,
>> connected with gigabit Ethernet.
>>
>>
>> Regards,
>>
>> --
>> // John Markus Bjørndalen
>> // http://www.cs.uit.no/~johnm/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>