On Mon, 2009-05-18 at 17:05 -0400, Noam Bernstein wrote:
> The code is complicated, the input files are big and lead to long
> times, so I don't think I'll be able to make a simple test case.
> I attached to the hanging processes (all 8 of them) with gdb
> during the hang. The stack trace is below. Nodes seem to spend most of
> their time in the btl_openib_component_progress(), and occasionally in
> mca_pml_ob1_progress(). I.e. not completely stuck, but not making
Can you confirm that *all* processes are in PMPI_Allreduce at some
point, the collectives commonly get blamed for a lot of hangs and it's
not always the correct place to look.
> P.S. I get a similar hang with MVAPICH, in a nearby but different part
> of the
> code (on an MPI_Bcast, specifically), increasing my tendency to believe
> that it's OFED's fault. But maybe the stack trace will suggest to
> where it might be stuck, and therefore perhaps an mca flag to try?
This strikes me as a filesystem problem more than MPI per se. Again
with MVAPICH are all your processes in MPI_Bcast or just some of them?