Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] application hangs with multiple dup
From: Edgar Gabriel (gabriel_at_[hidden])
Date: 2009-09-10 06:50:38


Two short questions: do you have any open MPI mca parameters set in a
file or at runtime? And second, is there any difference if you disable
the hierarch coll module (which does communicate additionally as well?) e.g.

mpirun --mca coll ^hierarch -np 4 ./mytest

Thanks
Edgar

Thomas Ropars wrote:
> Ashley Pittman wrote:
>> On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:
>>
>> Thank you. I think you missed the top three lines of the output but
>> that doesn't matter.
>>
>>
>>> main() at ?:?
>>> PMPI_Comm_dup() at pcomm_dup.c:62
>>> ompi_comm_dup() at communicator/comm.c:661
>>> -----------------
>>> [0,2] (2 processes)
>>> -----------------
>>> ompi_comm_nextcid() at communicator/comm_cid.c:264
>>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>>> coll_tuned_decision_fixed.c:61
>>> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
>>> coll_tuned_allreduce.c:223
>>> ompi_request_default_wait_all() at request/req_wait.c:262
>>> opal_condition_wait() at ../opal/threads/condition.h:99
>>> -----------------
>>> [1,3] (2 processes)
>>> -----------------
>>> ompi_comm_nextcid() at communicator/comm_cid.c:245
>>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>>> coll_tuned_decision_fixed.c:61
>>> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
>>> coll_tuned_allreduce.c:223
>>> ompi_request_default_wait_all() at request/req_wait.c:262
>>> opal_condition_wait() at ../opal/threads/condition.h:99
>>>
>>
>> Lines 264 and 245 of comm_cid.c are both in a for loop which calls
>> allreduce() twice in a loop until a certain condition is met. As such
>> it's hard to tell from this trace if it is processes [0,2] are "ahead"
>> or [1,3] are "behind". Either way you look at it however the
>> all_reduce() should not deadlock like that so it's as likely to be a bug
>> in reduce as it is in ompi_comm_nextcid() from the trace.
>>
>> I assume all four processes are actually in the same call to comm_dup,
>> re-compiling your program with -g and re-running padb would confirm this
>> as it would show the line numbers.
>>
> Yes they are all in the second call to comm_dup.
>
> Thomas
>> Ashley,
>>
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335