Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-08-12 09:12:26


Can you try this with the current trunk (r23587 or later)?

I just added a number of new features and bug fixes, and I would be interested to see if it fixes the problem. In particular I suspect that this might be related to the Init/Finalize bounding of the checkpoint region.

-- Josh

On Aug 10, 2010, at 2:18 PM, <ananda.mudar_at_[hidden]> <ananda.mudar_at_[hidden]> wrote:

> Josh
>
> Please find attached is the python program that reproduces the hang that
> I described. Initial part of this file describes the prerequisite
> modules and the steps to reproduce the problem. Please let me know if
> you have any questions in reproducing the hang.
>
> Please note that, if I add the following lines at the end of the program
> (in case sleep_time is True), the problem disappears ie; program resumes
> successfully after successful completion of checkpoint.
> # Add following lines at the end for sleep_time is True
> else:
> time.sleep(0.1)
> # End of added lines
>
>
> Thanks a lot for your time in looking into this issue.
>
> Regards
> Ananda
>
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093
> ananda.mudar_at_[hidden]
>
>
> -----Original Message-----
> Date: Mon, 9 Aug 2010 16:37:58 -0400
> From: Joshua Hursey <jjhursey_at_[hidden]>
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <270BD450-743A-4662-9568-1FEDFCC6F9C6_at_[hidden]>
> Content-Type: text/plain; charset=windows-1252
>
> I have not tried to checkpoint an mpi4py application, so I cannot say
> for sure if it works or not. You might be hitting something with the
> Python runtime interacting in an odd way with either Open MPI or BLCR.
>
> Can you attach a debugger and get a backtrace on a stuck checkpoint?
> That might show us where things are held up.
>
> -- Josh
>
>
> On Aug 9, 2010, at 4:04 PM, <ananda.mudar_at_[hidden]>
> <ananda.mudar_at_[hidden]> wrote:
>
>> Hi
>>
>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
> 0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I
> see that program doesn?t resume sometimes after successful checkpoint
> creation. This doesn?t occur always meaning the program resumes after
> successful checkpoint creation most of the time and completes
> successfully. Has anyone tested the checkpoint/restart functionality
> with mpi4py programs? Are there any best practices that I should keep in
> mind while checkpointing mpi4py programs?
>>
>> Thanks for your time
>> - Ananda
>> Please do not print this email unless it is absolutely necessary.
>>
>> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any attachments.
>>
>> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>>
>> www.wipro.com
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> ------------------------------
>
> Message: 8
> Date: Mon, 9 Aug 2010 13:50:03 -0700
> From: John Hsu <johnhsu_at_[hidden]>
> Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
> To: Open MPI Users <users_at_[hidden]>
> Message-ID:
> <AANLkTim63t=wQMeWfHWNnvnVKajOe92e7NG3X=Warwr0_at_[hidden]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
> -npernode
> 11), so I proceeded to bump up -npernode to 12:
>
> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode
> 12 --mca btl_sm_use_knem 0 ./bin/mpi_test
>
> and the same error occurs,
>
> (gdb) bt
> #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
> #1 0x00007fcca7e5ea4b in epoll_dispatch ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #2 0x00007fcca7e665fa in opal_event_base_loop ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #3 0x00007fcca7e37e69 in opal_progress ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #4 0x00007fcca15b6e95 in mca_pml_ob1_recv ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> #5 0x00007fcca7dd635c in PMPI_Recv ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff2a0d7e00,
> count=1, datatype=..., source=23, tag=100, status=...)
> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
> (gdb)
>
>
> (gdb) bt
> #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
> #1 0x00007f5dc454ba4b in epoll_dispatch ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #2 0x00007f5dc45535fa in opal_event_base_loop ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #3 0x00007f5dc4524e69 in opal_progress ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #4 0x00007f5dbdca4b1d in mca_pml_ob1_send ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> #5 0x00007f5dc44c574f in PMPI_Send ()
> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff6e0c0790,
> count=1, datatype=..., dest=0, tag=100)
> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
> (gdb)
>
>
>
>
> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
>> In your first mail, you mentioned that you are testing the new knem
>> support.
>>
>> Can you try disabling knem and see if that fixes the problem? (i.e.,
> run
>> with --mca btl_sm_use_knem 0") If it fixes the issue, that might mean
> we
>> have a knem-based bug.
>>
>>
>>
>> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>>
>>> Hi,
>>>
>>> sorry for the confusion, that was indeed the trunk version of things
> I
>> was running.
>>>
>>> Here's the same problem using
>>>
>>>
>>
> http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
> bz2
>>>
>>> command-line:
>>>
>>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode
>> 11 ./bin/mpi_test
>>>
>>> back trace on sender:
>>>
>>> (gdb) bt
>>> #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
>>> #1 0x00007fa004f43a4b in epoll_dispatch ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2 0x00007fa004f4b5fa in opal_event_base_loop ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3 0x00007fa004f1ce69 in opal_progress ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4 0x00007f9ffe69be95 in mca_pml_ob1_recv ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5 0x00007fa004ebb35c in PMPI_Recv ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
>>> tag=100, status=...)
>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>>> (gdb)
>>>
>>> back trace on receiver:
>>>
>>> (gdb) bt
>>> #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
>>> #1 0x00007fcce2f1ea4b in epoll_dispatch ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2 0x00007fcce2f265fa in opal_event_base_loop ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3 0x00007fcce2ef7e69 in opal_progress ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4 0x00007fccdc677b1d in mca_pml_ob1_send ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5 0x00007fcce2e9874f in PMPI_Send ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
>> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>>> (gdb)
>>>
>>> and attached is my mpi_test file for reference.
>>>
>>> thanks,
>>> John
>>>
>>>
>>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <rhc_at_[hidden]>
> wrote:
>>> You clearly have an issue with version confusion. The file cited in
> your
>> warning:
>>>
>>>> [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
>>>
>>> does not exist in 1.5rc5. It only exists in the developer's trunk at
> this
>> time. Check to ensure you have the right paths set, blow away the
> install
>> area (in case you have multiple versions installed on top of each
> other),
>> etc.
>>>
>>>
>>>
>>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
>>>
>>>> Hi All,
>>>> I am new to openmpi and have encountered an issue using
> pre-release
>> 1.5rc5, for a simple mpi code (see attached). In this test, nodes 1
> to n
>> sends out a random number to node 0, node 0 sums all numbers received.
>>>>
>>>> This code works fine on 1 machine with any number of nodes, and on
> 3
>> machines running 10 nodes per machine, but when we try to run 11 nodes
> per
>> machine this warning appears:
>>>>
>>>> [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
>>>>
>>>> And node 0 (master summing node) hangs on receiving plus another
> random
>> node hangs on sending indefinitely. Below are the back traces:
>>>>
>>>> (gdb) bt
>>>> #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
>>>> #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
>> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
>>>> #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
>> flags=2) at event.c:838
>>>> #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766
>>>> #4 0x00007f0c604ebb5a in opal_progress () at
>> runtime/opal_progress.c:189
>>>> #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
>> m=0x7f0c60800400) at ../../../../opal/threads/
>>>> condition.h:99
>>>> #6 0x00007f0c59b79dff in ompi_request_wait_completion
> (req=0x2538d80)
>> at ../../../../ompi/request/request.h:377
>>>> #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
>> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
>>>> status=0x7fff90f62668) at pml_ob1_irecv.c:104
>>>> #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1,
>> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
>> status=0x7fff90f62668)
>>>> at precv.c:78
>>>> #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
> status=...)
>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:30
>>>> (gdb)
>>>>
>>>> and for sender is:
>>>>
>>>> (gdb) bt
>>>> #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
>>>> #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
>> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
>>>> #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
>> flags=2) at event.c:838
>>>> #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766
>>>> #4 0x00007fedba59c43a in opal_progress () at
>> runtime/opal_progress.c:189
>>>> #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
>> m=0x7fedba8ba740)
>>>> at ../../../../opal/threads/condition.h:99
>>>> #6 0x00007fedb279742e in ompi_request_wait_completion
> (req=0x2392d80)
>> at ../../../../ompi/request/request.h:377
>>>> #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
> count=100,
>> datatype=0x612600, dst=0, tag=100,
>>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
>> pml_ob1_isend.c:125
>>>> #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
>> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
>>>> at psend.c:75
>>>> #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
>> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:42
>>>> (gdb)
>>>>
>>>> The "deadlock" appears to be a machine dependent race condition,
>> different machines fails with different combinations of nodes /
> machine.
>>>>
>>>> Below is my command line for reference:
>>>>
>>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
>> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
>> orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test
>>>>
>>>> The problem does not exist in release 1.4.2 or earlier. We are
> testing
>> unreleased codes for potential knem benefits, but can fall back to
> 1.4.2 if
>> necessary.
>>>>
>>>> My apologies in advance if I've missed something basic, thanks for
> any
>> help :)
>>>>
>>>> regards,
>>>> John
>>>> <test.cpp>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> <mpi_test.cpp>_______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 9
> Date: Mon, 9 Aug 2010 23:02:51 +0200
> From: Riccardo Murri <riccardo.murri_at_[hidden]>
> Subject: Re: [OMPI users] MPI Template Datatype?
> To: Open MPI Users <users_at_[hidden]>
> Message-ID:
> <AANLkTi=Peq+CQ6t+EXaKhwOT=wd0B8VjWc88sHXqrdYw_at_[hidden]>
> Content-Type: text/plain; charset=UTF-8
>
> Hi Alexandru,
>
> you can read all about Boost.MPI at:
>
> http://www.boost.org/doc/libs/1_43_0/doc/html/mpi.html
>
>
> On Mon, Aug 9, 2010 at 10:27 PM, Alexandru Blidaru <alexsb92_at_[hidden]>
> wrote:
>> I basically have to implement a 4D vector. An additional goal of my
> project
>> is to support char, int, float and double datatypes in the vector.
>
> If your "vector" is fixed-size (i.e., all vectors are comprised of
> 4 elements), then you can likely dispose of std::vector, use
> C-style arrays with templated send/receive calls (that would
> be just interfaces to MPI_Send/MPI_Recv)
>
> // BEWARE: untested code!!!
>
> template <typename T>
> int send(T* vector, int dest, int tag, MPI_Comm comm) {
> throw std::logic_error("called generic MyVector::send");
> };
>
> template <typename T>
> int recv(T* vector, int source, int tag, MPI_Comm comm) {
> throw std::logic_error("called generic MyVector::send");
> };
>
> and then you specialize the template for the types you actually use:
>
> template <>
> int send<double>(int* vector, int dest, int tag, MPI_Comm comm)
> {
> return MPI_Send(vector, 4, MPI_DOUBLE, dest, tag, comm);
> };
>
> template <>
> int recv<double>(int* vector, int src, int tag, MPI_Comm comm)
> {
> return MPI_Recv(vector, 4, MPI_DOUBLE, dest, tag, comm);
> };
>
> // etc.
>
> However, let me warn you that it would likely take more time and
> effort to write all the template specializations and get them working
> than just use Boost.MPI.
>
> Best regards,
> Riccardo
>
>
> ------------------------------
>
> Message: 10
> Date: Mon, 9 Aug 2010 17:42:26 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
> To: "Open MPI Users" <users_at_[hidden]>
> Cc: Brice Goglin <Brice.Goglin_at_[hidden]>
> Message-ID: <7283451E-8C4A-4F15-B8E5-649349ABBE0C_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> I've opened a ticket about this -- if it's an actual problem, it's a 1.5
> blocker:
>
> https://svn.open-mpi.org/trac/ompi/ticket/2530
>
> What version of knem and Linux are you using?
>
>
>
> On Aug 9, 2010, at 4:50 PM, John Hsu wrote:
>
>> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
> -npernode 11), so I proceeded to bump up -npernode to 12:
>>
>> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode 12 --mca btl_sm_use_knem 0 ./bin/mpi_test
>>
>> and the same error occurs,
>>
>> (gdb) bt
>> #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
>> #1 0x00007fcca7e5ea4b in epoll_dispatch ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #2 0x00007fcca7e665fa in opal_event_base_loop ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #3 0x00007fcca7e37e69 in opal_progress ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #4 0x00007fcca15b6e95 in mca_pml_ob1_recv ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>> #5 0x00007fcca7dd635c in PMPI_Recv ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff2a0d7e00,
>> count=1, datatype=..., source=23, tag=100, status=...)
>> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
>> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>> (gdb)
>>
>>
>> (gdb) bt
>> #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
>> #1 0x00007f5dc454ba4b in epoll_dispatch ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #2 0x00007f5dc45535fa in opal_event_base_loop ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #3 0x00007f5dc4524e69 in opal_progress ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #4 0x00007f5dbdca4b1d in mca_pml_ob1_send ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>> #5 0x00007f5dc44c574f in PMPI_Send ()
>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff6e0c0790,
>> count=1, datatype=..., dest=0, tag=100)
>> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
>> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>> (gdb)
>>
>>
>>
>>
>> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
>> In your first mail, you mentioned that you are testing the new knem
> support.
>>
>> Can you try disabling knem and see if that fixes the problem? (i.e.,
> run with --mca btl_sm_use_knem 0") If it fixes the issue, that might
> mean we have a knem-based bug.
>>
>>
>>
>> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>>
>>> Hi,
>>>
>>> sorry for the confusion, that was indeed the trunk version of things
> I was running.
>>>
>>> Here's the same problem using
>>>
>>>
> http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
> bz2
>>>
>>> command-line:
>>>
>>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode 11 ./bin/mpi_test
>>>
>>> back trace on sender:
>>>
>>> (gdb) bt
>>> #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
>>> #1 0x00007fa004f43a4b in epoll_dispatch ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2 0x00007fa004f4b5fa in opal_event_base_loop ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3 0x00007fa004f1ce69 in opal_progress ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4 0x00007f9ffe69be95 in mca_pml_ob1_recv ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5 0x00007fa004ebb35c in PMPI_Recv ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
>>> tag=100, status=...)
>>> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
>>> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>>> (gdb)
>>>
>>> back trace on receiver:
>>>
>>> (gdb) bt
>>> #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
>>> #1 0x00007fcce2f1ea4b in epoll_dispatch ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2 0x00007fcce2f265fa in opal_event_base_loop ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3 0x00007fcce2ef7e69 in opal_progress ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4 0x00007fccdc677b1d in mca_pml_ob1_send ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5 0x00007fcce2e9874f in PMPI_Send ()
>>> from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
>>> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
>>> at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>>> (gdb)
>>>
>>> and attached is my mpi_test file for reference.
>>>
>>> thanks,
>>> John
>>>
>>>
>>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <rhc_at_[hidden]>
> wrote:
>>> You clearly have an issue with version confusion. The file cited in
> your warning:
>>>
>>>> [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
>>>
>>> does not exist in 1.5rc5. It only exists in the developer's trunk at
> this time. Check to ensure you have the right paths set, blow away the
> install area (in case you have multiple versions installed on top of
> each other), etc.
>>>
>>>
>>>
>>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
>>>
>>>> Hi All,
>>>> I am new to openmpi and have encountered an issue using
> pre-release 1.5rc5, for a simple mpi code (see attached). In this test,
> nodes 1 to n sends out a random number to node 0, node 0 sums all
> numbers received.
>>>>
>>>> This code works fine on 1 machine with any number of nodes, and on
> 3 machines running 10 nodes per machine, but when we try to run 11 nodes
> per machine this warning appears:
>>>>
>>>> [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
>>>>
>>>> And node 0 (master summing node) hangs on receiving plus another
> random node hangs on sending indefinitely. Below are the back traces:
>>>>
>>>> (gdb) bt
>>>> #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
>>>> #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
>>>> #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
> flags=2) at event.c:838
>>>> #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766
>>>> #4 0x00007f0c604ebb5a in opal_progress () at
> runtime/opal_progress.c:189
>>>> #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
> m=0x7f0c60800400) at ../../../../opal/threads/
>>>> condition.h:99
>>>> #6 0x00007f0c59b79dff in ompi_request_wait_completion
> (req=0x2538d80) at ../../../../ompi/request/request.h:377
>>>> #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
>>>> status=0x7fff90f62668) at pml_ob1_irecv.c:104
>>>> #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1,
> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
> status=0x7fff90f62668)
>>>> at precv.c:78
>>>> #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
> status=...)
>>>> at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
>>>> at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:30
>>>> (gdb)
>>>>
>>>> and for sender is:
>>>>
>>>> (gdb) bt
>>>> #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
>>>> #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
>>>> #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
> flags=2) at event.c:838
>>>> #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766
>>>> #4 0x00007fedba59c43a in opal_progress () at
> runtime/opal_progress.c:189
>>>> #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
> m=0x7fedba8ba740)
>>>> at ../../../../opal/threads/condition.h:99
>>>> #6 0x00007fedb279742e in ompi_request_wait_completion
> (req=0x2392d80) at ../../../../ompi/request/request.h:377
>>>> #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
> count=100, datatype=0x612600, dst=0, tag=100,
>>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
> pml_ob1_isend.c:125
>>>> #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
>>>> at psend.c:75
>>>> #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
>>>> at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
>>>> at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:42
>>>> (gdb)
>>>>
>>>> The "deadlock" appears to be a machine dependent race condition,
> different machines fails with different combinations of nodes / machine.
>>>>
>>>> Below is my command line for reference:
>>>>
>>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
> orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test
>>>>
>>>> The problem does not exist in release 1.4.2 or earlier. We are
> testing unreleased codes for potential knem benefits, but can fall back
> to 1.4.2 if necessary.
>>>>
>>>> My apologies in advance if I've missed something basic, thanks for
> any help :)
>>>>
>>>> regards,
>>>> John
>>>> <test.cpp>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> <mpi_test.cpp>_______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 11
> Date: Mon, 9 Aug 2010 14:48:04 -0700
> From: John Hsu <johnhsu_at_[hidden]>
> Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
> To: Open MPI Users <users_at_[hidden]>
> Cc: Brice Goglin <Brice.Goglin_at_[hidden]>
> Message-ID:
> <AANLkTimpmgtuZMSdmGAfReoNzzdX9KRPz+wtxRgaHuqE_at_[hidden]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I've replied in the ticket.
> https://svn.open-mpi.org/trac/ompi/ticket/2530#comment:2
> thanks!
> John
>
> On Mon, Aug 9, 2010 at 2:42 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
>> I've opened a ticket about this -- if it's an actual problem, it's a
> 1.5
>> blocker:
>>
>> https://svn.open-mpi.org/trac/ompi/ticket/2530
>>
>> What version of knem and Linux are you using?
>>
>>
>>
>> On Aug 9, 2010, at 4:50 PM, John Hsu wrote:
>>
>>> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
>> -npernode 11), so I proceeded to bump up -npernode to 12:
>>>
>>> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
>> -npernode 12 --mca btl_sm_use_knem 0 ./bin/mpi_test
>>>
>>> and the same error occurs,
>>>
>>> (gdb) bt
>>> #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
>>> #1 0x00007fcca7e5ea4b in epoll_dispatch ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2 0x00007fcca7e665fa in opal_event_base_loop ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3 0x00007fcca7e37e69 in opal_progress ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4 0x00007fcca15b6e95 in mca_pml_ob1_recv ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5 0x00007fcca7dd635c in PMPI_Recv ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff2a0d7e00,
>>> count=1, datatype=..., source=23, tag=100, status=...)
>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>>> (gdb)
>>>
>>>
>>> (gdb) bt
>>> #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
>>> #1 0x00007f5dc454ba4b in epoll_dispatch ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2 0x00007f5dc45535fa in opal_event_base_loop ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3 0x00007f5dc4524e69 in opal_progress ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4 0x00007f5dbdca4b1d in mca_pml_ob1_send ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5 0x00007f5dc44c574f in PMPI_Send ()
>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
>> buf=0x7fff6e0c0790,
>>> count=1, datatype=..., dest=0, tag=100)
>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>>> (gdb)
>>>
>>>
>>>
>>>
>>> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
>>> In your first mail, you mentioned that you are testing the new knem
>> support.
>>>
>>> Can you try disabling knem and see if that fixes the problem?
> (i.e., run
>> with --mca btl_sm_use_knem 0") If it fixes the issue, that might mean
> we
>> have a knem-based bug.
>>>
>>>
>>>
>>> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>>>
>>>> Hi,
>>>>
>>>> sorry for the confusion, that was indeed the trunk version of
> things I
>> was running.
>>>>
>>>> Here's the same problem using
>>>>
>>>>
>>
> http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
> bz2
>>>>
>>>> command-line:
>>>>
>>>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
>> -npernode 11 ./bin/mpi_test
>>>>
>>>> back trace on sender:
>>>>
>>>> (gdb) bt
>>>> #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
>>>> #1 0x00007fa004f43a4b in epoll_dispatch ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #2 0x00007fa004f4b5fa in opal_event_base_loop ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #3 0x00007fa004f1ce69 in opal_progress ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #4 0x00007f9ffe69be95 in mca_pml_ob1_recv ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>>> #5 0x00007fa004ebb35c in PMPI_Recv ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
>>>> tag=100, status=...)
>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>>>> (gdb)
>>>>
>>>> back trace on receiver:
>>>>
>>>> (gdb) bt
>>>> #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
>>>> #1 0x00007fcce2f1ea4b in epoll_dispatch ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #2 0x00007fcce2f265fa in opal_event_base_loop ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #3 0x00007fcce2ef7e69 in opal_progress ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #4 0x00007fccdc677b1d in mca_pml_ob1_send ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>>> #5 0x00007fcce2e9874f in PMPI_Send ()
>>>> from
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
>> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>>>> (gdb)
>>>>
>>>> and attached is my mpi_test file for reference.
>>>>
>>>> thanks,
>>>> John
>>>>
>>>>
>>>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>>>> You clearly have an issue with version confusion. The file cited
> in
>> your warning:
>>>>
>>>>> [wgsg0:29074] Warning -- mutex was double locked from
>> errmgr_hnp.c:772
>>>>
>>>> does not exist in 1.5rc5. It only exists in the developer's trunk
> at
>> this time. Check to ensure you have the right paths set, blow away the
>> install area (in case you have multiple versions installed on top of
> each
>> other), etc.
>>>>
>>>>
>>>>
>>>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
>>>>
>>>>> Hi All,
>>>>> I am new to openmpi and have encountered an issue using
> pre-release
>> 1.5rc5, for a simple mpi code (see attached). In this test, nodes 1
> to n
>> sends out a random number to node 0, node 0 sums all numbers received.
>>>>>
>>>>> This code works fine on 1 machine with any number of nodes, and
> on 3
>> machines running 10 nodes per machine, but when we try to run 11 nodes
> per
>> machine this warning appears:
>>>>>
>>>>> [wgsg0:29074] Warning -- mutex was double locked from
>> errmgr_hnp.c:772
>>>>>
>>>>> And node 0 (master summing node) hangs on receiving plus another
>> random node hangs on sending indefinitely. Below are the back traces:
>>>>>
>>>>> (gdb) bt
>>>>> #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
>>>>> #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
>> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
>>>>> #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
>> flags=2) at event.c:838
>>>>> #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at
> event.c:766
>>>>> #4 0x00007f0c604ebb5a in opal_progress () at
>> runtime/opal_progress.c:189
>>>>> #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
>> m=0x7f0c60800400) at ../../../../opal/threads/
>>>>> condition.h:99
>>>>> #6 0x00007f0c59b79dff in ompi_request_wait_completion
>> (req=0x2538d80) at ../../../../ompi/request/request.h:377
>>>>> #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
>> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
>>>>> status=0x7fff90f62668) at pml_ob1_irecv.c:104
>>>>> #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0,
> count=1,
>> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
>> status=0x7fff90f62668)
>>>>> at precv.c:78
>>>>> #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
> status=...)
>>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
>>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:30
>>>>> (gdb)
>>>>>
>>>>> and for sender is:
>>>>>
>>>>> (gdb) bt
>>>>> #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
>>>>> #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
>> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
>>>>> #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
>> flags=2) at event.c:838
>>>>> #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at
> event.c:766
>>>>> #4 0x00007fedba59c43a in opal_progress () at
>> runtime/opal_progress.c:189
>>>>> #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
>> m=0x7fedba8ba740)
>>>>> at ../../../../opal/threads/condition.h:99
>>>>> #6 0x00007fedb279742e in ompi_request_wait_completion
>> (req=0x2392d80) at ../../../../ompi/request/request.h:377
>>>>> #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
> count=100,
>> datatype=0x612600, dst=0, tag=100,
>>>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
>> pml_ob1_isend.c:125
>>>>> #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
>> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
>>>>> at psend.c:75
>>>>> #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
>> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
>>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
>>>>> at
>>
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:42
>>>>> (gdb)
>>>>>
>>>>> The "deadlock" appears to be a machine dependent race condition,
>> different machines fails with different combinations of nodes /
> machine.
>>>>>
>>>>> Below is my command line for reference:
>>>>>
>>>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
>> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
>> orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test
>>>>>
>>>>> The problem does not exist in release 1.4.2 or earlier. We are
>> testing unreleased codes for potential knem benefits, but can fall
> back to
>> 1.4.2 if necessary.
>>>>>
>>>>> My apologies in advance if I've missed something basic, thanks
> for
>> any help :)
>>>>>
>>>>> regards,
>>>>> John
>>>>> <test.cpp>_______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> <mpi_test.cpp>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1655, Issue 3
> **************************************
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
>
> www.wipro.com
> <mpi4py-ompi-bug.py>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users