Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] application hangs with multiple dup
From: Ashley Pittman (ashley_at_[hidden])
Date: 2009-09-09 13:37:35


On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:

Thank you. I think you missed the top three lines of the output but
that doesn't matter.

> main() at ?:?
> PMPI_Comm_dup() at pcomm_dup.c:62
> ompi_comm_dup() at communicator/comm.c:661
> -----------------
> [0,2] (2 processes)
> -----------------
> ompi_comm_nextcid() at communicator/comm_cid.c:264
> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
> ompi_coll_tuned_allreduce_intra_dec_fixed() at
> coll_tuned_decision_fixed.c:61
> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
> coll_tuned_allreduce.c:223
> ompi_request_default_wait_all() at request/req_wait.c:262
> opal_condition_wait() at ../opal/threads/condition.h:99
> -----------------
> [1,3] (2 processes)
> -----------------
> ompi_comm_nextcid() at communicator/comm_cid.c:245
> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
> ompi_coll_tuned_allreduce_intra_dec_fixed() at
> coll_tuned_decision_fixed.c:61
> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
> coll_tuned_allreduce.c:223
> ompi_request_default_wait_all() at request/req_wait.c:262
> opal_condition_wait() at ../opal/threads/condition.h:99

Lines 264 and 245 of comm_cid.c are both in a for loop which calls
allreduce() twice in a loop until a certain condition is met. As such
it's hard to tell from this trace if it is processes [0,2] are "ahead"
or [1,3] are "behind". Either way you look at it however the
all_reduce() should not deadlock like that so it's as likely to be a bug
in reduce as it is in ompi_comm_nextcid() from the trace.

I assume all four processes are actually in the same call to comm_dup,
re-compiling your program with -g and re-running padb would confirm this
as it would show the line numbers.

Ashley,

-- 
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk