Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] application hangs with multiple dup
From: Thomas Ropars (tropars_at_[hidden])
Date: 2009-09-09 11:44:01


Ashley Pittman wrote:
> On Tue, 2009-09-08 at 15:00 +0200, Thomas Ropars wrote:
>
>> Hi,
>>
>> I'm working on r21949 of the trunk.
>>
>> When I run on a single node with 4 processes this simple program calling
>> 2 times MPI_Comm_dup , the processes hang from time to time in the 2nd dup.
>>
>
> I can't reproduce this, how often does it fail? I've run it in a loop
> hundreds of times here and not had one hang.
>
It happens once every 4 or 5 runs. And it also happens if the processes
are on different nodes.

Here is the ouptut I get from padb -axt :

main() at ?:?
  PMPI_Comm_dup() at pcomm_dup.c:62
    ompi_comm_dup() at communicator/comm.c:661
      -----------------
      [0,2] (2 processes)
      -----------------
      ompi_comm_nextcid() at communicator/comm_cid.c:264
        ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
          ompi_coll_tuned_allreduce_intra_dec_fixed() at
coll_tuned_decision_fixed.c:61
            ompi_coll_tuned_allreduce_intra_recursivedoubling() at
coll_tuned_allreduce.c:223
              ompi_request_default_wait_all() at request/req_wait.c:262
                opal_condition_wait() at ../opal/threads/condition.h:99
      -----------------
      [1,3] (2 processes)
      -----------------
      ompi_comm_nextcid() at communicator/comm_cid.c:245
        ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
          ompi_coll_tuned_allreduce_intra_dec_fixed() at
coll_tuned_decision_fixed.c:61
            ompi_coll_tuned_allreduce_intra_recursivedoubling() at
coll_tuned_allreduce.c:223
              ompi_request_default_wait_all() at request/req_wait.c:262
                opal_condition_wait() at ../opal/threads/condition.h:99

Thomas
> Off-topic I know but this is exactly the type of problem that padb is
> designed to help with, if you could get it to hang and then run "padb
> -axt" in another window on the same node and send along the output I'm
> sure it would be of help.
>
> Ashley,
>
>