This is my first post to the openmpi developers list. I think I've run
across a race condition in your latest release. Since my demonstrator
is somewhat large and cumbersome, I'd like to know if you already know
about this issue before we start the process of providing code and details.
Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.
Symptoms: our code hangs, always in the same vicinity, usually at the
same place, 10-25% of the time. Sometimes more often, sometimes less.
Our code has run reliably with many MPI implementations for years. We
haven't added anything recently that is a likely culprit. While we have
our own issues, this doesn't feel like one of ours.
We see that there is new code in the shared memory transport between
1.3.1 and 1.3.2. Our code doesn't hang with 1.3.1 (nor 1.2.9). Only
If we switch to tcp for transport (with mpirun --mca btl tcp,self ...)
we don't see any hangs. Running using --mca btl sm,self results in hangs.
If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of the
problem, we no longer see hangs.
We demonstrate this with 4 processes. When we attach a debugger to the
hung processes, we see that the hang results from an MPI_Allreduce. All
processes have made the same call to MPI_Allreduce. The processes are
all in opal_progress, called (with intervening calls) by MPI_Allreduce.
My question is, have you seen anything like this before? If not, what
do we do next?
Bryan Lally, lally_at_[hidden]
Los Alamos National Laboratory
Los Alamos, New Mexico