Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Hang in collectives involving shared memory
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-06-10 11:07:25


Hi Ashley

Thanks! I would definitely be interested and will look at the tool.
Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph

On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman <ashley_at_[hidden]>wrote:

>
> Ralph,
>
> If I may say this is exactly the type of problem the tool I have been
> working on recently aims to help with and I'd be happy to help you
> through it.
>
> Firstly I'd say of the three collectives you mention, MPI_Allgather,
> MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
> and the last a many-to-one communication pattern. The scenario of a
> root process falling behind and getting swamped in comms is a plausible
> one for MPI_Reduce only but doesn't hold water with the other two. You
> also don't mention if the loop is over a single collective or if you
> have loop calling a number of different collectives each iteration.
>
> padb, the tool I've been working on has the ability to look at parallel
> jobs and report on the state of collective comms and should help narrow
> you down on erroneous processes and those simply blocked waiting for
> comms. I'd recommend using it to look at maybe four or five instances
> where the application has hung and look for any common features between
> them.
>
> Let me know if you are willing to try this route and I'll talk, the code
> is downloadable from http://padb.pittman.org.uk and if you want the full
> collective functionality you'll need to patch openmp with the patch from
> http://padb.pittman.org.uk/extensions.html
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>