On Tue, 2009-06-16 at 13:39 -0600, Bryan Lally wrote:
> Ashley Pittman wrote:
> > Whilst the fact that it appears to only happen on your machine implies
> > it's not a general problem with OpenMPI the fact that it happens in the
> > same location/rep count every time does swing the blame back the other
> > way.
> This sounds a _lot_ like the problem I was seeing, my initial message is
> appended here. If it's the same thing, then it's not only on the big
> machines here that Ralph was talking about, but on very vanilla Fedora 7
> and 9 boxes.
> I was able to hang Ralph's reproducer on an 8 core Dell, Fedora 9,
> kernel 2.6.27(.4-78.2.53.fc9.x86_64).
> I don't think it's just the one machine and it's configuration.
Interesting. In Ralphs case the hangs I've seen are where the
application calls Bcast but the MPI library calls barrier below this (it
does this every 1000 collectives apparently), it could be that any call
to Barrier at this point would hang or it could be something special
about the subverted call which is causing the problem.
Do you have a stack trace of your hung application to hand, in
particular when you say "All
processes have made the same call to MPI_Allreduce. The processes are
all in opal_progress, called (with intervening calls) by MPI_Allreduce."
do the intervening calls include mca_coll_sync_bcast
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing