Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-02-27 10:29:03


Gleb / George --

Is there an easy way for us to put a cap on max recusion down in
opal_progress? Just put in a counter in opal_progress() such that if
it exceeds some max value, return success without doing anything (if
opal_progress_event_flag indicates that nothing *needs* to be done)?

This wouldn't *guarantee* that we cap recursion (because we sometimes
need progress before returning), but it might help for the cases that
are described in this mail (we're eagerly progressing when we don't
necessarily need to)...?

On Feb 22, 2008, at 9:48 AM, John Markus Bjørndalen wrote:

> Hi,
>
> I ran into a bug when running a few microbenchmarks for OpenMPI. I had
> thrown in Reduce and Gather for sanity checking, but OpenMPI crashed
> when running those operations. Usually, this would happen when I
> reached
> around 12-16 nodes.
>
> My current crash-test code looks like this (I've removed a few lines
> that were commented out):
>
> ----------- snip-------------
> #include <mpi.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <assert.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
> int rank, size, count = 1;
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
>
> if (argc > 1)
> count = atoi(argv[1]);
>
> int n, i;
> // Just make sure we have plenty of buffer for any operation
> int *sbuf = malloc(sizeof(int) * 2 * count);
> int *rbuf = malloc(sizeof(int) * 2 * count);
> assert(sbuf);
> assert(rbuf);
>
> for (n = 1; n <= 10000; n += 100) {
> printf("N = %d\n", n);
> fflush(stdout);
> for (i = 0; i < n; i++) {
> MPI_Reduce(sbuf, rbuf, count, MPI_INT, MPI_SUM, 0,
> MPI_COMM_WORLD);
> }
> MPI_Barrier(MPI_COMM_WORLD);
> printf(" -- DONE\n");
> fflush(stdout);
> MPI_Barrier(MPI_COMM_WORLD);
> }
> MPI_Finalize();
> return 0;
> }
> ----------- snip-------------
>
>
> I can usually trigger a crash with count=1, and n=1000 using 16+
> nodes,
> but I can also trigger it with 44 nodes and larger packets (around
> 32k
> ints I think). I can also crash it on a single host using 19
> processes,
> but then it usually doesn't crash until I reach somewhere between
> 1200-3000 iterations. Gather seems to have the same problems as
> Reduce.
>
> The output from running gdb on the coredump looks like this:
>
> ----------- snip-------------
> Using host libthread_db library "/lib/tls/libthread_db.so.1".
> Core was generated by `./ompi-crash2'.
> Program terminated with signal 11, Segmentation fault.
> #0 0x00434184 in sysconf () from /lib/tls/libc.so.6
> #0 0x00434184 in sysconf () from /lib/tls/libc.so.6
> #1 0xb7e78b59 in _int_malloc () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #2 0xb7e799ce in malloc () from /home/johnm/local/ompi/lib/libopen-
> pal.so.0
> #3 0xb7f04852 in ompi_free_list_grow () from
> /home/johnm/local/ompi/lib/libmpi.so.0
> #4 0xb7d74e70 in mca_btl_tcp_endpoint_recv_handler () from
> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
> #5 0xb7e62b44 in opal_event_base_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #6 0xb7e62cff in opal_event_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #7 0xb7e5d284 in opal_progress () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #8 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
> #9 0xb7e62b44 in opal_event_base_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #10 0xb7e62cff in opal_event_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #11 0xb7e5d284 in opal_progress () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #12 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
> #13 0xb7e62b44 in opal_event_base_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #14 0xb7e62cff in opal_event_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
>
> ... and then continues until...
>
> #1356848 0xb7e5d284 in opal_progress () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356849 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
> #1356850 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
> #1356851 0xb7e62b44 in opal_event_base_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356852 0xb7e62cff in opal_event_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356853 0xb7e5d284 in opal_progress () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356854 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
> #1356855 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
> #1356856 0xb7e62b44 in opal_event_base_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356857 0xb7e62cff in opal_event_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356858 0xb7e5d284 in opal_progress () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356859 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
> #1356860 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
> /home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
> #1356861 0xb7e62b44 in opal_event_base_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356862 0xb7e62cff in opal_event_loop () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356863 0xb7e5d284 in opal_progress () from
> /home/johnm/local/ompi/lib/libopen-pal.so.0
> #1356864 0xb7d8cb69 in mca_pml_ob1_recv () from
> /home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
> #1356865 0xb7d5bb1c in ompi_coll_tuned_reduce_intra_basic_linear ()
> from
> /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
> #1356866 0xb7d55913 in ompi_coll_tuned_reduce_intra_dec_fixed () from
> /home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
> #1356867 0xb7f3db6c in PMPI_Reduce () from
> /home/johnm/local/ompi/lib/libmpi.so.0
> #1356868 0x0804899e in main (argc=1, argv=0xbfba8a84) at ompi-
> crash2.c:58
> ----------- snip-------------
>
> I poked around in the code, and it looks like the culprit might be in
> the macros that try to allocate fragments in
> mca_pml_ob1_recv_frag_match: MCA_PML_OB1_RECV_FRAG_ALLOC and
> MCA_PML_OB1_RECV_FRAG_INIT use OMPI_FREE_LIST_WAIT, which again can
> end
> up calling opal_condition_wait(). opal_condition_wait() calls
> opal_progress() to "block", which looks like it leads to infinite
> recursion in this case.
>
> I guess the problem is a race condition when one node is hammered with
> incoming packets.
>
> The stack trace contains about 1.35 million lines, so I won't include
> all of it here, but here's some statistics to verify that not much
> else
> is happening in that stack (I can make the full trace available if
> anybody needs it):
>
> ----------- snip-------------
> Number of callframes: 1356870
> Called function statistics (how often in stackdump):
> PMPI_Reduce 1
> _int_malloc 1
> main 1
> malloc 1
> mca_btl_tcp_endpoint_recv_handler 339197
> mca_pml_ob1_recv 1
> mca_pml_ob1_recv_frag_match 72
> ompi_coll_tuned_reduce_intra_basic_linear 1
> ompi_coll_tuned_reduce_intra_dec_fixed 1
> ompi_free_list_grow 1
> opal_event_base_loop 339197
> opal_event_loop 339197
> opal_progress 339197
> sysconf 2
> Address statistics (how often in stackdump), plus functions with
> that addr
> (sanity check):
> 0x00434184 2 set(['sysconf'])
> 0x0804899e 1 set(['main'])
> 0xb7d55913 1
> set(['ompi_coll_tuned_reduce_intra_dec_fixed'])
> 0xb7d5bb1c 1
> set(['ompi_coll_tuned_reduce_intra_basic_linear'])
> 0xb7d74a7d 72
> set(['mca_btl_tcp_endpoint_recv_handler'])
> 0xb7d74e70 1
> set(['mca_btl_tcp_endpoint_recv_handler'])
> 0xb7d74f08 339124
> set(['mca_btl_tcp_endpoint_recv_handler'])
> 0xb7d8cb69 1
> set(['mca_pml_ob1_recv'])
> 0xb7d8f389 72
> set(['mca_pml_ob1_recv_frag_match'])
> 0xb7e5d284 339197
> set(['opal_progress'])
> 0xb7e62b44 339197
> set(['opal_event_base_loop'])
> 0xb7e62cff 339197
> set(['opal_event_loop'])
> 0xb7e78b59 1
> set(['_int_malloc'])
> 0xb7e799ce 1 set(['malloc'])
> 0xb7f04852 1
> set(['ompi_free_list_grow'])
> 0xb7f3db6c 1
> set(['PMPI_Reduce'])
> ----------- snip-------------
>
> I don't have any suggestions for a fix though, since this is the first
> time I've looked into the OpenMPI code.
>
> Btw. In case it makes a difference for triggering the bug: I'm running
> this on a cluster with 1 frontend and 44 nodes. The cluster runs Rocks
> 4.1, and each of the nodes are 3.2GHz P4 Prescott machines with 2GB
> RAM,
> connected with gigabit Ethernet.
>
>
> Regards,
>
> --
> // John Markus Bjørndalen
> // http://www.cs.uit.no/~johnm/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems