Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] application hangs with multiple dup
From: Thomas Ropars (tropars_at_[hidden])
Date: 2009-09-15 16:26:17


Hi,

Some news about that bug ?

Thomas

Edgar Gabriel wrote:
> so I can confirm that I can reproduce the hang, and we (George, Rainer
> and me) have looked into that and are continue digging.
>
> I hate to say that, but it looked to us as if messages were 'lost'
> (sender clearly called send and but the data is not in any of the
> queues on the receiver side), which seems to be consistent with two
> other bug reports currently being discussed on the mailing list. I
> could reproduce the hang with both sm and tcp, so its probably not a
> btl issue but somewhere higher.
>
> Thanks
> Edgar
>
> Thomas Ropars wrote:
>> Edgar Gabriel wrote:
>>> Two short questions: do you have any open MPI mca parameters set in
>>> a file or at runtime?
>> No
>>> And second, is there any difference if you disable the hierarch coll
>>> module (which does communicate additionally as well?) e.g.
>>>
>>> mpirun --mca coll ^hierarch -np 4 ./mytest
>> No, there is no difference.
>>
>> I don't know if it can help but : I've first had the problem when
>> launching bt.A.4 and sp.A.4 of the NAS Parallel Benchmarks (3.3
>> version).
>>
>> Thomas
>>>
>>> Thanks
>>> Edgar
>>>
>>> Thomas Ropars wrote:
>>>> Ashley Pittman wrote:
>>>>> On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:
>>>>>
>>>>> Thank you. I think you missed the top three lines of the output but
>>>>> that doesn't matter.
>>>>>
>>>>>
>>>>>> main() at ?:?
>>>>>> PMPI_Comm_dup() at pcomm_dup.c:62
>>>>>> ompi_comm_dup() at communicator/comm.c:661
>>>>>> -----------------
>>>>>> [0,2] (2 processes)
>>>>>> -----------------
>>>>>> ompi_comm_nextcid() at communicator/comm_cid.c:264
>>>>>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>>>>>> coll_tuned_decision_fixed.c:61
>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling()
>>>>>> at coll_tuned_allreduce.c:223
>>>>>> ompi_request_default_wait_all() at
>>>>>> request/req_wait.c:262
>>>>>> opal_condition_wait() at
>>>>>> ../opal/threads/condition.h:99
>>>>>> -----------------
>>>>>> [1,3] (2 processes)
>>>>>> -----------------
>>>>>> ompi_comm_nextcid() at communicator/comm_cid.c:245
>>>>>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>>>>>> coll_tuned_decision_fixed.c:61
>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling()
>>>>>> at coll_tuned_allreduce.c:223
>>>>>> ompi_request_default_wait_all() at
>>>>>> request/req_wait.c:262
>>>>>> opal_condition_wait() at
>>>>>> ../opal/threads/condition.h:99
>>>>>>
>>>>>
>>>>> Lines 264 and 245 of comm_cid.c are both in a for loop which calls
>>>>> allreduce() twice in a loop until a certain condition is met. As
>>>>> such
>>>>> it's hard to tell from this trace if it is processes [0,2] are
>>>>> "ahead"
>>>>> or [1,3] are "behind". Either way you look at it however the
>>>>> all_reduce() should not deadlock like that so it's as likely to be
>>>>> a bug
>>>>> in reduce as it is in ompi_comm_nextcid() from the trace.
>>>>>
>>>>> I assume all four processes are actually in the same call to
>>>>> comm_dup,
>>>>> re-compiling your program with -g and re-running padb would
>>>>> confirm this
>>>>> as it would show the line numbers.
>>>>>
>>>> Yes they are all in the second call to comm_dup.
>>>>
>>>> Thomas
>>>>> Ashley,
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>