Thanks! I would definitely be interested and will look at the tool.
Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?
Will be back after I look at the tool.
On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman <ashley_at_[hidden]>wrote:
> If I may say this is exactly the type of problem the tool I have been
> working on recently aims to help with and I'd be happy to help you
> through it.
> Firstly I'd say of the three collectives you mention, MPI_Allgather,
> MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
> and the last a many-to-one communication pattern. The scenario of a
> root process falling behind and getting swamped in comms is a plausible
> one for MPI_Reduce only but doesn't hold water with the other two. You
> also don't mention if the loop is over a single collective or if you
> have loop calling a number of different collectives each iteration.
> padb, the tool I've been working on has the ability to look at parallel
> jobs and report on the state of collective comms and should help narrow
> you down on erroneous processes and those simply blocked waiting for
> comms. I'd recommend using it to look at maybe four or five instances
> where the application has hung and look for any common features between
> Let me know if you are willing to try this route and I'll talk, the code
> is downloadable from http://padb.pittman.org.uk and if you want the full
> collective functionality you'll need to patch openmp with the patch from
> Ashley Pittman, Bath, UK.
> Padb - A parallel job inspection tool for cluster computing
> devel mailing list