On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote:
> Hi Ashley
> Thanks! I would definitely be interested and will look at the tool.
> Meantime, I have filed a bunch of data on this in ticket #1944, so
> perhaps you might take a glance at that and offer some thoughts?
> Will be back after I look at the tool.
Have you made any progress?
Whilst the fact that it appears to only happen on your machine implies
it's not a general problem with OpenMPI the fact that it happens in the
same location/rep count every time does swing the blame back the other
way. Perhaps it's some special configure or runtime option you are
setting? One thing that springs to mind is the numa-maps could be
exposing some timimg problem with shared memory calls however this
doesn't sit well with it always failing on the same iteration.
Can you provide stack traces from when it's hung and crucially are they
the same for every hang? If you change the coll_sync_barrier_before
value to make it hang on a different repetition does this change the
stack trace at all?
Likewise when you have applied the collectives patch is the collective
state the same for every hang and how does this differ when you change
the coll_sync_barrier_before variable?
It would be useful to see stack traces and collective state from the
three collectives you report as causing problems, MPI_Bcast, MPI_Reduce
and MPI_Allgather because as I said before these three collectives have
radically different communication patterns.
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing