Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing mpi4py program
From: ananda.mudar_at_[hidden]
Date: 2010-08-10 14:18:35


Josh

Please find attached is the python program that reproduces the hang that
I described. Initial part of this file describes the prerequisite
modules and the steps to reproduce the problem. Please let me know if
you have any questions in reproducing the hang.

Please note that, if I add the following lines at the end of the program
(in case sleep_time is True), the problem disappears ie; program resumes
successfully after successful completion of checkpoint.
# Add following lines at the end for sleep_time is True
else:
        time.sleep(0.1)
# End of added lines

Thanks a lot for your time in looking into this issue.

Regards
Ananda

Ananda B Mudar, PMP
Senior Technical Architect
Wipro Technologies
Ph: 972 765 8093
ananda.mudar_at_[hidden]

-----Original Message-----
Date: Mon, 9 Aug 2010 16:37:58 -0400
From: Joshua Hursey <jjhursey_at_[hidden]>
Subject: Re: [OMPI users] Checkpointing mpi4py program
To: Open MPI Users <users_at_[hidden]>
Message-ID: <270BD450-743A-4662-9568-1FEDFCC6F9C6_at_[hidden]>
Content-Type: text/plain; charset=windows-1252

I have not tried to checkpoint an mpi4py application, so I cannot say
for sure if it works or not. You might be hitting something with the
Python runtime interacting in an odd way with either Open MPI or BLCR.

Can you attach a debugger and get a backtrace on a stuck checkpoint?
That might show us where things are held up.

-- Josh

On Aug 9, 2010, at 4:04 PM, <ananda.mudar_at_[hidden]>
<ananda.mudar_at_[hidden]> wrote:

> Hi
>
> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I
see that program doesn?t resume sometimes after successful checkpoint
creation. This doesn?t occur always meaning the program resumes after
successful checkpoint creation most of the time and completes
successfully. Has anyone tested the checkpoint/restart functionality
with mpi4py programs? Are there any best practices that I should keep in
mind while checkpointing mpi4py programs?
>
> Thanks for your time
> - Ananda
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
should check this email and any attachments for the presence of viruses.
The company accepts no liability for any damage caused by any virus
transmitted by this email.
>
> www.wipro.com
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

------------------------------

Message: 8
Date: Mon, 9 Aug 2010 13:50:03 -0700
From: John Hsu <johnhsu_at_[hidden]>
Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
To: Open MPI Users <users_at_[hidden]>
Message-ID:
        <AANLkTim63t=wQMeWfHWNnvnVKajOe92e7NG3X=Warwr0_at_[hidden]>
Content-Type: text/plain; charset="iso-8859-1"

problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
-npernode
11), so I proceeded to bump up -npernode to 12:

$ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
-npernode
12 --mca btl_sm_use_knem 0 ./bin/mpi_test

and the same error occurs,

(gdb) bt
#0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
#1 0x00007fcca7e5ea4b in epoll_dispatch ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
#2 0x00007fcca7e665fa in opal_event_base_loop ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
#3 0x00007fcca7e37e69 in opal_progress ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
#4 0x00007fcca15b6e95 in mca_pml_ob1_recv ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
#5 0x00007fcca7dd635c in PMPI_Recv ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
#6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
buf=0x7fff2a0d7e00,
    count=1, datatype=..., source=23, tag=100, status=...)
    at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
#7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
    at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:30
(gdb)

(gdb) bt
#0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
#1 0x00007f5dc454ba4b in epoll_dispatch ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
#2 0x00007f5dc45535fa in opal_event_base_loop ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
#3 0x00007f5dc4524e69 in opal_progress ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
#4 0x00007f5dbdca4b1d in mca_pml_ob1_send ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
#5 0x00007f5dc44c574f in PMPI_Send ()
   from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
#6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
buf=0x7fff6e0c0790,
    count=1, datatype=..., dest=0, tag=100)
    at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
#7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
    at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:38
(gdb)

On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> In your first mail, you mentioned that you are testing the new knem
> support.
>
> Can you try disabling knem and see if that fixes the problem? (i.e.,
run
> with --mca btl_sm_use_knem 0") If it fixes the issue, that might mean
we
> have a knem-based bug.
>
>
>
> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>
> > Hi,
> >
> > sorry for the confusion, that was indeed the trunk version of things
I
> was running.
> >
> > Here's the same problem using
> >
> >
>
http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
bz2
> >
> > command-line:
> >
> > ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
-npernode
> 11 ./bin/mpi_test
> >
> > back trace on sender:
> >
> > (gdb) bt
> > #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
> > #1 0x00007fa004f43a4b in epoll_dispatch ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #2 0x00007fa004f4b5fa in opal_event_base_loop ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #3 0x00007fa004f1ce69 in opal_progress ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #4 0x00007f9ffe69be95 in mca_pml_ob1_recv ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > #5 0x00007fa004ebb35c in PMPI_Recv ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
> > tag=100, status=...)
> > at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
> > at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:30
> > (gdb)
> >
> > back trace on receiver:
> >
> > (gdb) bt
> > #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
> > #1 0x00007fcce2f1ea4b in epoll_dispatch ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #2 0x00007fcce2f265fa in opal_event_base_loop ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #3 0x00007fcce2ef7e69 in opal_progress ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #4 0x00007fccdc677b1d in mca_pml_ob1_send ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > #5 0x00007fcce2e9874f in PMPI_Send ()
> > from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
> > at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
> > at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:38
> > (gdb)
> >
> > and attached is my mpi_test file for reference.
> >
> > thanks,
> > John
> >
> >
> > On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <rhc_at_[hidden]>
wrote:
> > You clearly have an issue with version confusion. The file cited in
your
> warning:
> >
> > > [wgsg0:29074] Warning -- mutex was double locked from
errmgr_hnp.c:772
> >
> > does not exist in 1.5rc5. It only exists in the developer's trunk at
this
> time. Check to ensure you have the right paths set, blow away the
install
> area (in case you have multiple versions installed on top of each
other),
> etc.
> >
> >
> >
> > On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
> >
> > > Hi All,
> > > I am new to openmpi and have encountered an issue using
pre-release
> 1.5rc5, for a simple mpi code (see attached). In this test, nodes 1
to n
> sends out a random number to node 0, node 0 sums all numbers received.
> > >
> > > This code works fine on 1 machine with any number of nodes, and on
3
> machines running 10 nodes per machine, but when we try to run 11 nodes
per
> machine this warning appears:
> > >
> > > [wgsg0:29074] Warning -- mutex was double locked from
errmgr_hnp.c:772
> > >
> > > And node 0 (master summing node) hangs on receiving plus another
random
> node hangs on sending indefinitely. Below are the back traces:
> > >
> > > (gdb) bt
> > > #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
> > > #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
> > > #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
> flags=2) at event.c:838
> > > #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766
> > > #4 0x00007f0c604ebb5a in opal_progress () at
> runtime/opal_progress.c:189
> > > #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
> m=0x7f0c60800400) at ../../../../opal/threads/
> > > condition.h:99
> > > #6 0x00007f0c59b79dff in ompi_request_wait_completion
(req=0x2538d80)
> at ../../../../ompi/request/request.h:377
> > > #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
> > > status=0x7fff90f62668) at pml_ob1_irecv.c:104
> > > #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1,
> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
> status=0x7fff90f62668)
> > > at precv.c:78
> > > #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
status=...)
> > > at
>
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > > #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
> > > at
>
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
i_test/src/mpi_test.cpp:30
> > > (gdb)
> > >
> > > and for sender is:
> > >
> > > (gdb) bt
> > > #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
> > > #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
> > > #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
> flags=2) at event.c:838
> > > #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766
> > > #4 0x00007fedba59c43a in opal_progress () at
> runtime/opal_progress.c:189
> > > #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
> m=0x7fedba8ba740)
> > > at ../../../../opal/threads/condition.h:99
> > > #6 0x00007fedb279742e in ompi_request_wait_completion
(req=0x2392d80)
> at ../../../../ompi/request/request.h:377
> > > #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
count=100,
> datatype=0x612600, dst=0, tag=100,
> > > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
> pml_ob1_isend.c:125
> > > #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
> > > at psend.c:75
> > > #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
> > > at
>
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > > #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
> > > at
>
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
i_test/src/mpi_test.cpp:42
> > > (gdb)
> > >
> > > The "deadlock" appears to be a machine dependent race condition,
> different machines fails with different combinations of nodes /
machine.
> > >
> > > Below is my command line for reference:
> > >
> > > $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
> orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test
> > >
> > > The problem does not exist in release 1.4.2 or earlier. We are
testing
> unreleased codes for potential knem benefits, but can fall back to
1.4.2 if
> necessary.
> > >
> > > My apologies in advance if I've missed something basic, thanks for
any
> help :)
> > >
> > > regards,
> > > John
> > > <test.cpp>_______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > <mpi_test.cpp>_______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
-------------- next part --------------
HTML attachment scrubbed and removed

------------------------------

Message: 9
Date: Mon, 9 Aug 2010 23:02:51 +0200
From: Riccardo Murri <riccardo.murri_at_[hidden]>
Subject: Re: [OMPI users] MPI Template Datatype?
To: Open MPI Users <users_at_[hidden]>
Message-ID:
        <AANLkTi=Peq+CQ6t+EXaKhwOT=wd0B8VjWc88sHXqrdYw_at_[hidden]>
Content-Type: text/plain; charset=UTF-8

Hi Alexandru,

you can read all about Boost.MPI at:

  http://www.boost.org/doc/libs/1_43_0/doc/html/mpi.html

On Mon, Aug 9, 2010 at 10:27 PM, Alexandru Blidaru <alexsb92_at_[hidden]>
wrote:
> I basically have to implement a 4D vector. An additional goal of my
project
> is to support char, int, float and double datatypes in the vector.

If your "vector" is fixed-size (i.e., all vectors are comprised of
4 elements), then you can likely dispose of std::vector, use
C-style arrays with templated send/receive calls (that would
be just interfaces to MPI_Send/MPI_Recv)

   // BEWARE: untested code!!!

   template <typename T>
   int send(T* vector, int dest, int tag, MPI_Comm comm) {
       throw std::logic_error("called generic MyVector::send");
   };

   template <typename T>
   int recv(T* vector, int source, int tag, MPI_Comm comm) {
       throw std::logic_error("called generic MyVector::send");
   };

and then you specialize the template for the types you actually use:

  template <>
  int send<double>(int* vector, int dest, int tag, MPI_Comm comm)
  {
    return MPI_Send(vector, 4, MPI_DOUBLE, dest, tag, comm);
  };

  template <>
  int recv<double>(int* vector, int src, int tag, MPI_Comm comm)
  {
    return MPI_Recv(vector, 4, MPI_DOUBLE, dest, tag, comm);
  };

  // etc.

However, let me warn you that it would likely take more time and
effort to write all the template specializations and get them working
than just use Boost.MPI.

Best regards,
Riccardo

------------------------------

Message: 10
Date: Mon, 9 Aug 2010 17:42:26 -0400
From: Jeff Squyres <jsquyres_at_[hidden]>
Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
To: "Open MPI Users" <users_at_[hidden]>
Cc: Brice Goglin <Brice.Goglin_at_[hidden]>
Message-ID: <7283451E-8C4A-4F15-B8E5-649349ABBE0C_at_[hidden]>
Content-Type: text/plain; charset=us-ascii

I've opened a ticket about this -- if it's an actual problem, it's a 1.5
blocker:

    https://svn.open-mpi.org/trac/ompi/ticket/2530

What version of knem and Linux are you using?

On Aug 9, 2010, at 4:50 PM, John Hsu wrote:

> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
-npernode 11), so I proceeded to bump up -npernode to 12:
>
> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
-npernode 12 --mca btl_sm_use_knem 0 ./bin/mpi_test
>
> and the same error occurs,
>
> (gdb) bt
> #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
> #1 0x00007fcca7e5ea4b in epoll_dispatch ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> #2 0x00007fcca7e665fa in opal_event_base_loop ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> #3 0x00007fcca7e37e69 in opal_progress ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> #4 0x00007fcca15b6e95 in mca_pml_ob1_recv ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> #5 0x00007fcca7dd635c in PMPI_Recv ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
buf=0x7fff2a0d7e00,
> count=1, datatype=..., source=23, tag=100, status=...)
> at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
> at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:30
> (gdb)
>
>
> (gdb) bt
> #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
> #1 0x00007f5dc454ba4b in epoll_dispatch ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> #2 0x00007f5dc45535fa in opal_event_base_loop ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> #3 0x00007f5dc4524e69 in opal_progress ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> #4 0x00007f5dbdca4b1d in mca_pml_ob1_send ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> #5 0x00007f5dc44c574f in PMPI_Send ()
> from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
buf=0x7fff6e0c0790,
> count=1, datatype=..., dest=0, tag=100)
> at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
> at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:38
> (gdb)
>
>
>
>
> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquyres_at_[hidden]>
wrote:
> In your first mail, you mentioned that you are testing the new knem
support.
>
> Can you try disabling knem and see if that fixes the problem? (i.e.,
run with --mca btl_sm_use_knem 0") If it fixes the issue, that might
mean we have a knem-based bug.
>
>
>
> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>
> > Hi,
> >
> > sorry for the confusion, that was indeed the trunk version of things
I was running.
> >
> > Here's the same problem using
> >
> >
http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
bz2
> >
> > command-line:
> >
> > ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
-npernode 11 ./bin/mpi_test
> >
> > back trace on sender:
> >
> > (gdb) bt
> > #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
> > #1 0x00007fa004f43a4b in epoll_dispatch ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #2 0x00007fa004f4b5fa in opal_event_base_loop ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #3 0x00007fa004f1ce69 in opal_progress ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #4 0x00007f9ffe69be95 in mca_pml_ob1_recv ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > #5 0x00007fa004ebb35c in PMPI_Recv ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
> > tag=100, status=...)
> > at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
> > at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:30
> > (gdb)
> >
> > back trace on receiver:
> >
> > (gdb) bt
> > #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
> > #1 0x00007fcce2f1ea4b in epoll_dispatch ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #2 0x00007fcce2f265fa in opal_event_base_loop ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #3 0x00007fcce2ef7e69 in opal_progress ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #4 0x00007fccdc677b1d in mca_pml_ob1_send ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > #5 0x00007fcce2e9874f in PMPI_Send ()
> > from
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #6 0x000000000040adda in MPI::Comm::Send (this=0x612800,
buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
> > at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
> > at
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:38
> > (gdb)
> >
> > and attached is my mpi_test file for reference.
> >
> > thanks,
> > John
> >
> >
> > On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <rhc_at_[hidden]>
wrote:
> > You clearly have an issue with version confusion. The file cited in
your warning:
> >
> > > [wgsg0:29074] Warning -- mutex was double locked from
errmgr_hnp.c:772
> >
> > does not exist in 1.5rc5. It only exists in the developer's trunk at
this time. Check to ensure you have the right paths set, blow away the
install area (in case you have multiple versions installed on top of
each other), etc.
> >
> >
> >
> > On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
> >
> > > Hi All,
> > > I am new to openmpi and have encountered an issue using
pre-release 1.5rc5, for a simple mpi code (see attached). In this test,
nodes 1 to n sends out a random number to node 0, node 0 sums all
numbers received.
> > >
> > > This code works fine on 1 machine with any number of nodes, and on
3 machines running 10 nodes per machine, but when we try to run 11 nodes
per machine this warning appears:
> > >
> > > [wgsg0:29074] Warning -- mutex was double locked from
errmgr_hnp.c:772
> > >
> > > And node 0 (master summing node) hangs on receiving plus another
random node hangs on sending indefinitely. Below are the back traces:
> > >
> > > (gdb) bt
> > > #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
> > > #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
> > > #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
flags=2) at event.c:838
> > > #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766
> > > #4 0x00007f0c604ebb5a in opal_progress () at
runtime/opal_progress.c:189
> > > #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
m=0x7f0c60800400) at ../../../../opal/threads/
> > > condition.h:99
> > > #6 0x00007f0c59b79dff in ompi_request_wait_completion
(req=0x2538d80) at ../../../../ompi/request/request.h:377
> > > #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
> > > status=0x7fff90f62668) at pml_ob1_irecv.c:104
> > > #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1,
type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
status=0x7fff90f62668)
> > > at precv.c:78
> > > #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
status=...)
> > > at
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > > #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
> > > at
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
i_test/src/mpi_test.cpp:30
> > > (gdb)
> > >
> > > and for sender is:
> > >
> > > (gdb) bt
> > > #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
> > > #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
> > > #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
flags=2) at event.c:838
> > > #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766
> > > #4 0x00007fedba59c43a in opal_progress () at
runtime/opal_progress.c:189
> > > #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
m=0x7fedba8ba740)
> > > at ../../../../opal/threads/condition.h:99
> > > #6 0x00007fedb279742e in ompi_request_wait_completion
(req=0x2392d80) at ../../../../ompi/request/request.h:377
> > > #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
count=100, datatype=0x612600, dst=0, tag=100,
> > > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
pml_ob1_isend.c:125
> > > #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
> > > at psend.c:75
> > > #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
> > > at
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > > #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
> > > at
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
i_test/src/mpi_test.cpp:42
> > > (gdb)
> > >
> > > The "deadlock" appears to be a machine dependent race condition,
different machines fails with different combinations of nodes / machine.
> > >
> > > Below is my command line for reference:
> > >
> > > $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test
> > >
> > > The problem does not exist in release 1.4.2 or earlier. We are
testing unreleased codes for potential knem benefits, but can fall back
to 1.4.2 if necessary.
> > >
> > > My apologies in advance if I've missed something basic, thanks for
any help :)
> > >
> > > regards,
> > > John
> > > <test.cpp>_______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > <mpi_test.cpp>_______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
------------------------------
Message: 11
Date: Mon, 9 Aug 2010 14:48:04 -0700
From: John Hsu <johnhsu_at_[hidden]>
Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
To: Open MPI Users <users_at_[hidden]>
Cc: Brice Goglin <Brice.Goglin_at_[hidden]>
Message-ID:
	<AANLkTimpmgtuZMSdmGAfReoNzzdX9KRPz+wtxRgaHuqE_at_[hidden]>
Content-Type: text/plain; charset="iso-8859-1"
I've replied in the ticket.
https://svn.open-mpi.org/trac/ompi/ticket/2530#comment:2
thanks!
John
On Mon, Aug 9, 2010 at 2:42 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> I've opened a ticket about this -- if it's an actual problem, it's a
1.5
> blocker:
>
>    https://svn.open-mpi.org/trac/ompi/ticket/2530
>
> What version of knem and Linux are you using?
>
>
>
> On Aug 9, 2010, at 4:50 PM, John Hsu wrote:
>
> > problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
> -npernode 11), so I proceeded to bump up -npernode to 12:
> >
> > $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode 12 --mca btl_sm_use_knem 0  ./bin/mpi_test
> >
> > and the same error occurs,
> >
> > (gdb) bt
> > #0  0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
> > #1  0x00007fcca7e5ea4b in epoll_dispatch ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #2  0x00007fcca7e665fa in opal_event_base_loop ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #3  0x00007fcca7e37e69 in opal_progress ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #4  0x00007fcca15b6e95 in mca_pml_ob1_recv ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > #5  0x00007fcca7dd635c in PMPI_Recv ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff2a0d7e00,
> >     count=1, datatype=..., source=23, tag=100, status=...)
> >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > #7  0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
> >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:30
> > (gdb)
> >
> >
> > (gdb) bt
> > #0  0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
> > #1  0x00007f5dc454ba4b in epoll_dispatch ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #2  0x00007f5dc45535fa in opal_event_base_loop ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #3  0x00007f5dc4524e69 in opal_progress ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #4  0x00007f5dbdca4b1d in mca_pml_ob1_send ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > #5  0x00007f5dc44c574f in PMPI_Send ()
> >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > #6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff6e0c0790,
> >     count=1, datatype=..., dest=0, tag=100)
> >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > #7  0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
> >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:38
> > (gdb)
> >
> >
> >
> >
> > On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquyres_at_[hidden]>
wrote:
> > In your first mail, you mentioned that you are testing the new knem
> support.
> >
> > Can you try disabling knem and see if that fixes the problem?
(i.e., run
> with --mca btl_sm_use_knem 0")  If it fixes the issue, that might mean
we
> have a knem-based bug.
> >
> >
> >
> > On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
> >
> > > Hi,
> > >
> > > sorry for the confusion, that was indeed the trunk version of
things I
> was running.
> > >
> > > Here's the same problem using
> > >
> > >
>
http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
bz2
> > >
> > > command-line:
> > >
> > > ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode 11 ./bin/mpi_test
> > >
> > > back trace on sender:
> > >
> > > (gdb) bt
> > > #0  0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
> > > #1  0x00007fa004f43a4b in epoll_dispatch ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > > #2  0x00007fa004f4b5fa in opal_event_base_loop ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > > #3  0x00007fa004f1ce69 in opal_progress ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > > #4  0x00007f9ffe69be95 in mca_pml_ob1_recv ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > > #5  0x00007fa004ebb35c in PMPI_Recv ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > > #6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
> > >     tag=100, status=...)
> > >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > > #7  0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
> > >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:30
> > > (gdb)
> > >
> > > back trace on receiver:
> > >
> > > (gdb) bt
> > > #0  0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
> > > #1  0x00007fcce2f1ea4b in epoll_dispatch ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > > #2  0x00007fcce2f265fa in opal_event_base_loop ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > > #3  0x00007fcce2ef7e69 in opal_progress ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > > #4  0x00007fccdc677b1d in mca_pml_ob1_send ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> > > #5  0x00007fcce2e9874f in PMPI_Send ()
> > >    from
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/lib/libmpi.so.0
> > > #6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
> > >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > > #7  0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
> > >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
i/mpi_test/src/mpi_test.cpp:38
> > > (gdb)
> > >
> > > and attached is my mpi_test file for reference.
> > >
> > > thanks,
> > > John
> > >
> > >
> > > On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> > > You clearly have an issue with version confusion. The file cited
in
> your warning:
> > >
> > > > [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
> > >
> > > does not exist in 1.5rc5. It only exists in the developer's trunk
at
> this time. Check to ensure you have the right paths set, blow away the
> install area (in case you have multiple versions installed on top of
each
> other), etc.
> > >
> > >
> > >
> > > On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
> > >
> > > > Hi All,
> > > > I am new to openmpi and have encountered an issue using
pre-release
> 1.5rc5, for a simple mpi code (see attached).  In this test, nodes 1
to n
> sends out a random number to node 0, node 0 sums all numbers received.
> > > >
> > > > This code works fine on 1 machine with any number of nodes, and
on 3
> machines running 10 nodes per machine, but when we try to run 11 nodes
per
> machine this warning appears:
> > > >
> > > > [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
> > > >
> > > > And node 0 (master summing node) hangs on receiving plus another
> random node hangs on sending indefinitely.  Below are the back traces:
> > > >
> > > > (gdb) bt
> > > > #0  0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
> > > > #1  0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
> > > > #2  0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
> flags=2) at event.c:838
> > > > #3  0x00007f0c6053ac27 in opal_event_loop (flags=2) at
event.c:766
> > > > #4  0x00007f0c604ebb5a in opal_progress () at
> runtime/opal_progress.c:189
> > > > #5  0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
> m=0x7f0c60800400) at ../../../../opal/threads/
> > > > condition.h:99
> > > > #6  0x00007f0c59b79dff in ompi_request_wait_completion
> (req=0x2538d80) at ../../../../ompi/request/request.h:377
> > > > #7  0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
> > > >     status=0x7fff90f62668) at pml_ob1_irecv.c:104
> > > > #8  0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0,
count=1,
> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
> status=0x7fff90f62668)
> > > >     at precv.c:78
> > > > #9  0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
status=...)
> > > >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> > > > #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
> > > >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
i_test/src/mpi_test.cpp:30
> > > > (gdb)
> > > >
> > > > and for sender is:
> > > >
> > > > (gdb) bt
> > > > #0  0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
> > > > #1  0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
> > > > #2  0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
> flags=2) at event.c:838
> > > > #3  0x00007fedba5edbaf in opal_event_loop (flags=2) at
event.c:766
> > > > #4  0x00007fedba59c43a in opal_progress () at
> runtime/opal_progress.c:189
> > > > #5  0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
> m=0x7fedba8ba740)
> > > >     at ../../../../opal/threads/condition.h:99
> > > > #6  0x00007fedb279742e in ompi_request_wait_completion
> (req=0x2392d80) at ../../../../ompi/request/request.h:377
> > > > #7  0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
count=100,
> datatype=0x612600, dst=0, tag=100,
> > > >     sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
> pml_ob1_isend.c:125
> > > > #8  0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
> > > >     at psend.c:75
> > > > #9  0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
> > > >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> > > > #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
> > > >     at
>
/wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
i_test/src/mpi_test.cpp:42
> > > > (gdb)
> > > >
> > > > The "deadlock" appears to be a machine dependent race condition,
> different machines fails with different combinations of nodes /
machine.
> > > >
> > > > Below is my command line for reference:
> > > >
> > > > $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
> orte_base_help_aggregate 0 -mca opal_debug_locks 1  ./bin/mpi_test
> > > >
> > > > The problem does not exist in release 1.4.2 or earlier.  We are
> testing unreleased codes for potential knem benefits, but can fall
back to
> 1.4.2 if necessary.
> > > >
> > > > My apologies in advance if I've missed something basic, thanks
for
> any help :)
> > > >
> > > > regards,
> > > > John
> > > > <test.cpp>_______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > <mpi_test.cpp>_______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
End of users Digest, Vol 1655, Issue 3
**************************************
Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 

www.wipro.com