Ashley Pittman wrote:
> On Tue, 2009-09-08 at 15:00 +0200, Thomas Ropars wrote:
>
>> Hi,
>>
>> I'm working on r21949 of the trunk.
>>
>> When I run on a single node with 4 processes this simple program calling
>> 2 times MPI_Comm_dup , the processes hang from time to time in the 2nd dup.
>>
>
> I can't reproduce this, how often does it fail? I've run it in a loop
> hundreds of times here and not had one hang.
>
It happens once every 4 or 5 runs. And it also happens if the processes
are on different nodes.
Here is the ouptut I get from padb -axt :
main() at ?:?
PMPI_Comm_dup() at pcomm_dup.c:62
ompi_comm_dup() at communicator/comm.c:661
-----------------
[0,2] (2 processes)
-----------------
ompi_comm_nextcid() at communicator/comm_cid.c:264
ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
ompi_coll_tuned_allreduce_intra_dec_fixed() at
coll_tuned_decision_fixed.c:61
ompi_coll_tuned_allreduce_intra_recursivedoubling() at
coll_tuned_allreduce.c:223
ompi_request_default_wait_all() at request/req_wait.c:262
opal_condition_wait() at ../opal/threads/condition.h:99
-----------------
[1,3] (2 processes)
-----------------
ompi_comm_nextcid() at communicator/comm_cid.c:245
ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
ompi_coll_tuned_allreduce_intra_dec_fixed() at
coll_tuned_decision_fixed.c:61
ompi_coll_tuned_allreduce_intra_recursivedoubling() at
coll_tuned_allreduce.c:223
ompi_request_default_wait_all() at request/req_wait.c:262
opal_condition_wait() at ../opal/threads/condition.h:99
Thomas
> Off-topic I know but this is exactly the type of problem that padb is
> designed to help with, if you could get it to hang and then run "padb
> -axt" in another window on the same node and send along the output I'm
> sure it would be of help.
>
> Ashley,
>
>
|