Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] application hangs with multiple dup
From: Edgar Gabriel (gabriel_at_[hidden])
Date: 2009-09-16 18:01:46


just wanted to give a heads-up that I *think* I know what the problem
is. I should have a fix (with a description) either later today or
tomorrow morning...

Thanks
Edgar

Edgar Gabriel wrote:
> so I can confirm that I can reproduce the hang, and we (George, Rainer
> and me) have looked into that and are continue digging.
>
> I hate to say that, but it looked to us as if messages were 'lost'
> (sender clearly called send and but the data is not in any of the queues
> on the receiver side), which seems to be consistent with two other bug
> reports currently being discussed on the mailing list. I could reproduce
> the hang with both sm and tcp, so its probably not a btl issue but
> somewhere higher.
>
> Thanks
> Edgar
>
> Thomas Ropars wrote:
>> Edgar Gabriel wrote:
>>> Two short questions: do you have any open MPI mca parameters set in a
>>> file or at runtime?
>> No
>>> And second, is there any difference if you disable the hierarch coll
>>> module (which does communicate additionally as well?) e.g.
>>>
>>> mpirun --mca coll ^hierarch -np 4 ./mytest
>> No, there is no difference.
>>
>> I don't know if it can help but : I've first had the problem when
>> launching bt.A.4 and sp.A.4 of the NAS Parallel Benchmarks (3.3 version).
>>
>> Thomas
>>>
>>> Thanks
>>> Edgar
>>>
>>> Thomas Ropars wrote:
>>>> Ashley Pittman wrote:
>>>>> On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:
>>>>>
>>>>> Thank you. I think you missed the top three lines of the output but
>>>>> that doesn't matter.
>>>>>
>>>>>
>>>>>> main() at ?:?
>>>>>> PMPI_Comm_dup() at pcomm_dup.c:62
>>>>>> ompi_comm_dup() at communicator/comm.c:661
>>>>>> -----------------
>>>>>> [0,2] (2 processes)
>>>>>> -----------------
>>>>>> ompi_comm_nextcid() at communicator/comm_cid.c:264
>>>>>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>>>>>> coll_tuned_decision_fixed.c:61
>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
>>>>>> coll_tuned_allreduce.c:223
>>>>>> ompi_request_default_wait_all() at
>>>>>> request/req_wait.c:262
>>>>>> opal_condition_wait() at
>>>>>> ../opal/threads/condition.h:99
>>>>>> -----------------
>>>>>> [1,3] (2 processes)
>>>>>> -----------------
>>>>>> ompi_comm_nextcid() at communicator/comm_cid.c:245
>>>>>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>>>>>> coll_tuned_decision_fixed.c:61
>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
>>>>>> coll_tuned_allreduce.c:223
>>>>>> ompi_request_default_wait_all() at
>>>>>> request/req_wait.c:262
>>>>>> opal_condition_wait() at
>>>>>> ../opal/threads/condition.h:99
>>>>>>
>>>>>
>>>>> Lines 264 and 245 of comm_cid.c are both in a for loop which calls
>>>>> allreduce() twice in a loop until a certain condition is met. As such
>>>>> it's hard to tell from this trace if it is processes [0,2] are "ahead"
>>>>> or [1,3] are "behind". Either way you look at it however the
>>>>> all_reduce() should not deadlock like that so it's as likely to be
>>>>> a bug
>>>>> in reduce as it is in ompi_comm_nextcid() from the trace.
>>>>>
>>>>> I assume all four processes are actually in the same call to comm_dup,
>>>>> re-compiling your program with -g and re-running padb would confirm
>>>>> this
>>>>> as it would show the line numbers.
>>>>>
>>>> Yes they are all in the second call to comm_dup.
>>>>
>>>> Thomas
>>>>> Ashley,
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335