Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Hang in collectives involving shared memory
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-06-10 11:07:25


Hi Ashley

Thanks! I would definitely be interested and will look at the tool.
Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph

On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman <ashley_at_[hidden]>wrote:

>
> Ralph,
>
> If I may say this is exactly the type of problem the tool I have been
> working on recently aims to help with and I'd be happy to help you
> through it.
>
> Firstly I'd say of the three collectives you mention, MPI_Allgather,
> MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
> and the last a many-to-one communication pattern. The scenario of a
> root process falling behind and getting swamped in comms is a plausible
> one for MPI_Reduce only but doesn't hold water with the other two. You
> also don't mention if the loop is over a single collective or if you
> have loop calling a number of different collectives each iteration.
>
> padb, the tool I've been working on has the ability to look at parallel
> jobs and report on the state of collective comms and should help narrow
> you down on erroneous processes and those simply blocked waiting for
> comms. I'd recommend using it to look at maybe four or five instances
> where the application has hung and look for any common features between
> them.
>
> Let me know if you are willing to try this route and I'll talk, the code
> is downloadable from http://padb.pittman.org.uk and if you want the full
> collective functionality you'll need to patch openmp with the patch from
> http://padb.pittman.org.uk/extensions.html
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>