Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] possible bug in 1.3.2 sm transport
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-11 20:09:01

Bryan Lally wrote:

> I think I've run across a race condition in your latest release.
> Since my demonstrator is somewhat large and cumbersome, I'd like to
> know if you already know about this issue before we start the process
> of providing code and details.
> Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.
> Symptoms: our code hangs, always in the same vicinity, usually at the
> same place, 10-25% of the time. Sometimes more often, sometimes less.
> Our code has run reliably with many MPI implementations for years. We
> haven't added anything recently that is a likely culprit. While we
> have our own issues, this doesn't feel like one of ours.
> We see that there is new code in the shared memory transport between
> 1.3.1 and 1.3.2. Our code doesn't hang with 1.3.1 (nor 1.2.9). Only
> with 1.3.2.
> If we switch to tcp for transport (with mpirun --mca btl tcp,self ...)
> we don't see any hangs. Running using --mca btl sm,self results in
> hangs.
> If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of the
> problem, we no longer see hangs.
> We demonstrate this with 4 processes. When we attach a debugger to
> the hung processes, we see that the hang results from an
> MPI_Allreduce. All processes have made the same call to
> MPI_Allreduce. The processes are all in opal_progress, called (with
> intervening calls) by MPI_Allreduce.
> My question is, have you seen anything like this before? If not, what
> do we do next?

Another user reports something somewhat similar at . That
problem seems to be associated with GCC 4.4.0. What compiler are you using?

In some test runs, we see some MPI_Allreduce hangs, but only after about
40K trials (rather than 10-25% of the time).

So, it may be that others have seen what you are seeing, but we don't (I
don't) currently understand what's going on.