Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-08-09 17:42:26


I've opened a ticket about this -- if it's an actual problem, it's a 1.5 blocker:

    https://svn.open-mpi.org/trac/ompi/ticket/2530

What version of knem and Linux are you using?

On Aug 9, 2010, at 4:50 PM, John Hsu wrote:

> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with -npernode 11), so I proceeded to bump up -npernode to 12:
>
> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX -npernode 12 --mca btl_sm_use_knem 0 ./bin/mpi_test
>
> and the same error occurs,
>
> (gdb) bt
> #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
> #1 0x00007fcca7e5ea4b in epoll_dispatch ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> #2 0x00007fcca7e665fa in opal_event_base_loop ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> #3 0x00007fcca7e37e69 in opal_progress ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> #4 0x00007fcca15b6e95 in mca_pml_ob1_recv ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> #5 0x00007fcca7dd635c in PMPI_Recv ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, buf=0x7fff2a0d7e00,
> count=1, datatype=..., source=23, tag=100, status=...)
> at /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
> at /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:30
> (gdb)
>
>
> (gdb) bt
> #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
> #1 0x00007f5dc454ba4b in epoll_dispatch ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> #2 0x00007f5dc45535fa in opal_event_base_loop ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> #3 0x00007f5dc4524e69 in opal_progress ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> #4 0x00007f5dbdca4b1d in mca_pml_ob1_send ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> #5 0x00007f5dc44c574f in PMPI_Send ()
> from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, buf=0x7fff6e0c0790,
> count=1, datatype=..., dest=0, tag=100)
> at /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
> at /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:38
> (gdb)
>
>
>
>
> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> In your first mail, you mentioned that you are testing the new knem support.
>
> Can you try disabling knem and see if that fixes the problem? (i.e., run with --mca btl_sm_use_knem 0") If it fixes the issue, that might mean we have a knem-based bug.
>
>
>
> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>
> > Hi,
> >
> > sorry for the confusion, that was indeed the trunk version of things I was running.
> >
> > Here's the same problem using
> >
> > http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.bz2
> >
> > command-line:
> >
> > ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX -npernode 11 ./bin/mpi_test
> >
> > back trace on sender:
> >
> > (gdb) bt
> > #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
> > #1 0x00007fa004f43a4b in epoll_dispatch ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> > #2 0x00007fa004f4b5fa in opal_event_base_loop ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> > #3 0x00007fa004f1ce69 in opal_progress ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> > #4 0x00007f9ffe69be95 in mca_pml_ob1_recv ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > #5 0x00007fa004ebb35c in PMPI_Recv ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> > #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
> > tag=100, status=...)
> > at /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
> > at /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:30
> > (gdb)
> >
> > back trace on receiver:
> >
> > (gdb) bt
> > #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
> > #1 0x00007fcce2f1ea4b in epoll_dispatch ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> > #2 0x00007fcce2f265fa in opal_event_base_loop ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> > #3 0x00007fcce2ef7e69 in opal_progress ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> > #4 0x00007fccdc677b1d in mca_pml_ob1_send ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > #5 0x00007fcce2e9874f in PMPI_Send ()
> > from /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0
> > #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
> > at /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
> > at /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:38
> > (gdb)
> >
> > and attached is my mpi_test file for reference.
> >
> > thanks,
> > John
> >
> >
> > On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> > You clearly have an issue with version confusion. The file cited in your warning:
> >
> > > [wgsg0:29074] Warning -- mutex was double locked from errmgr_hnp.c:772
> >
> > does not exist in 1.5rc5. It only exists in the developer's trunk at this time. Check to ensure you have the right paths set, blow away the install area (in case you have multiple versions installed on top of each other), etc.
> >
> >
> >
> > On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
> >
> > > Hi All,
> > > I am new to openmpi and have encountered an issue using pre-release 1.5rc5, for a simple mpi code (see attached). In this test, nodes 1 to n sends out a random number to node 0, node 0 sums all numbers received.
> > >
> > > This code works fine on 1 machine with any number of nodes, and on 3 machines running 10 nodes per machine, but when we try to run 11 nodes per machine this warning appears:
> > >
> > > [wgsg0:29074] Warning -- mutex was double locked from errmgr_hnp.c:772
> > >
> > > And node 0 (master summing node) hangs on receiving plus another random node hangs on sending indefinitely. Below are the back traces:
> > >
> > > (gdb) bt
> > > #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
> > > #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0, arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
> > > #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0, flags=2) at event.c:838
> > > #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766
> > > #4 0x00007f0c604ebb5a in opal_progress () at runtime/opal_progress.c:189
> > > #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0, m=0x7f0c60800400) at ../../../../opal/threads/
> > > condition.h:99
> > > #6 0x00007f0c59b79dff in ompi_request_wait_completion (req=0x2538d80) at ../../../../ompi/request/request.h:377
> > > #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0, count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
> > > status=0x7fff90f62668) at pml_ob1_irecv.c:104
> > > #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1, type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40, status=0x7fff90f62668)
> > > at precv.c:78
> > > #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800, buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100, status=...)
> > > at /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > > #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
> > > at /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:30
> > > (gdb)
> > >
> > > and for sender is:
> > >
> > > (gdb) bt
> > > #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
> > > #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880, arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
> > > #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880, flags=2) at event.c:838
> > > #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766
> > > #4 0x00007fedba59c43a in opal_progress () at runtime/opal_progress.c:189
> > > #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0, m=0x7fedba8ba740)
> > > at ../../../../opal/threads/condition.h:99
> > > #6 0x00007fedb279742e in ompi_request_wait_completion (req=0x2392d80) at ../../../../ompi/request/request.h:377
> > > #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210, count=100, datatype=0x612600, dst=0, tag=100,
> > > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at pml_ob1_isend.c:125
> > > #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100, type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
> > > at psend.c:75
> > > #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800, buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
> > > at /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > > #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
> > > at /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:42
> > > (gdb)
> > >
> > > The "deadlock" appears to be a machine dependent race condition, different machines fails with different combinations of nodes / machine.
> > >
> > > Below is my command line for reference:
> > >
> > > $ ../openmpi_devel/bin/mpirun -x PATH -hostfile hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test
> > >
> > > The problem does not exist in release 1.4.2 or earlier. We are testing unreleased codes for potential knem benefits, but can fall back to 1.4.2 if necessary.
> > >
> > > My apologies in advance if I've missed something basic, thanks for any help :)
> > >
> > > regards,
> > > John
> > > <test.cpp>_______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > <mpi_test.cpp>_______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/