Bryan Lally wrote:
> I think I've run across a race condition in your latest release.
> Since my demonstrator is somewhat large and cumbersome, I'd like to
> know if you already know about this issue before we start the process
> of providing code and details.
>
> Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.
> Symptoms: our code hangs, always in the same vicinity, usually at the
> same place, 10-25% of the time. Sometimes more often, sometimes less.
> Our code has run reliably with many MPI implementations for years. We
> haven't added anything recently that is a likely culprit. While we
> have our own issues, this doesn't feel like one of ours.
> We see that there is new code in the shared memory transport between
> 1.3.1 and 1.3.2. Our code doesn't hang with 1.3.1 (nor 1.2.9). Only
> with 1.3.2.
> If we switch to tcp for transport (with mpirun --mca btl tcp,self ...)
> we don't see any hangs. Running using --mca btl sm,self results in
> hangs.
> If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of the
> problem, we no longer see hangs.
>
> We demonstrate this with 4 processes. When we attach a debugger to
> the hung processes, we see that the hang results from an
> MPI_Allreduce. All processes have made the same call to
> MPI_Allreduce. The processes are all in opal_progress, called (with
> intervening calls) by MPI_Allreduce.
>
> My question is, have you seen anything like this before? If not, what
> do we do next?
Another user reports something somewhat similar at
http://www.open-mpi.org/community/lists/users/2009/04/9154.php . That
problem seems to be associated with GCC 4.4.0. What compiler are you using?
In some test runs, we see some MPI_Allreduce hangs, but only after about
40K trials (rather than 10-25% of the time).
So, it may be that others have seen what you are seeing, but we don't (I
don't) currently understand what's going on.
|