Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] application hangs with multiple dup
From: Thomas Ropars (tropars_at_[hidden])
Date: 2009-09-10 08:14:10


Edgar Gabriel wrote:
> Two short questions: do you have any open MPI mca parameters set in a
> file or at runtime?
No
> And second, is there any difference if you disable the hierarch coll
> module (which does communicate additionally as well?) e.g.
>
> mpirun --mca coll ^hierarch -np 4 ./mytest
No, there is no difference.

I don't know if it can help but : I've first had the problem when
launching bt.A.4 and sp.A.4 of the NAS Parallel Benchmarks (3.3 version).

Thomas
>
> Thanks
> Edgar
>
> Thomas Ropars wrote:
>> Ashley Pittman wrote:
>>> On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:
>>>
>>> Thank you. I think you missed the top three lines of the output but
>>> that doesn't matter.
>>>
>>>
>>>> main() at ?:?
>>>> PMPI_Comm_dup() at pcomm_dup.c:62
>>>> ompi_comm_dup() at communicator/comm.c:661
>>>> -----------------
>>>> [0,2] (2 processes)
>>>> -----------------
>>>> ompi_comm_nextcid() at communicator/comm_cid.c:264
>>>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>>>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>>>> coll_tuned_decision_fixed.c:61
>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
>>>> coll_tuned_allreduce.c:223
>>>> ompi_request_default_wait_all() at
>>>> request/req_wait.c:262
>>>> opal_condition_wait() at
>>>> ../opal/threads/condition.h:99
>>>> -----------------
>>>> [1,3] (2 processes)
>>>> -----------------
>>>> ompi_comm_nextcid() at communicator/comm_cid.c:245
>>>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>>>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>>>> coll_tuned_decision_fixed.c:61
>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
>>>> coll_tuned_allreduce.c:223
>>>> ompi_request_default_wait_all() at
>>>> request/req_wait.c:262
>>>> opal_condition_wait() at
>>>> ../opal/threads/condition.h:99
>>>>
>>>
>>> Lines 264 and 245 of comm_cid.c are both in a for loop which calls
>>> allreduce() twice in a loop until a certain condition is met. As such
>>> it's hard to tell from this trace if it is processes [0,2] are "ahead"
>>> or [1,3] are "behind". Either way you look at it however the
>>> all_reduce() should not deadlock like that so it's as likely to be a
>>> bug
>>> in reduce as it is in ompi_comm_nextcid() from the trace.
>>>
>>> I assume all four processes are actually in the same call to comm_dup,
>>> re-compiling your program with -g and re-running padb would confirm
>>> this
>>> as it would show the line numbers.
>>>
>> Yes they are all in the second call to comm_dup.
>>
>> Thomas
>>> Ashley,
>>>
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>